Exploring Gradient Boosting vs Linear Regression: Selecting the Best Prediction Tool and Developing a Streamlit App Creating your project’s environment Using VSCode for data evaluation Data Treatment Data Evaluation Machine Learning: Gradient Boosting vs Linear Regression Assessing Performance of Machine Learning Models Saving the Best Model Saving Predictions to DataBase and using Streamlit App Running your Streamlit App Final Words References:

Artificial Intelligence

Exploring Gradient Boosting vs Linear Regression: Selecting the Best Prediction Tool and Developing a Streamlit App Creating your project’s environment Using VSCode for data evaluation Data Treatment Data Evaluation Machine Learning: Gradient Boosting vs Linear Regression Assessing Performance of Machine Learning Models Saving the Best Model Saving Predictions to DataBase and using Streamlit App Running your Streamlit App Final Words References:

admin

May 30, 2023

Exploring Gradient Boosting vs Linear Regression: Selecting the Best Prediction Tool and Developing a Streamlit App
Creating your project’s environment
Using VSCode for data evaluation
Data Treatment
Data Evaluation
Machine Learning: Gradient Boosting vs Linear Regression
Assessing Performance of Machine Learning Models
Saving the Best Model
Saving Predictions to DataBase and using Streamlit App
Running your Streamlit App
Final Words
References:

Hey there! I’m Ana, a knowledge enthusiast and a machine learning apprentice. Welcome to my first post on Medium, where I’ll be sharing my journey and insights into the exciting world of knowledge evaluation and predictive modeling. In this text, I would like to take you on a fascinating exploration of two powerful prediction tools: Gradient Boosting and Linear Regression models. We’ll dive into the realm of knowledge preprocessing, model selection, and even construct a practical Streamlit app to showcase the facility of those models. So grab a cup of coffee and join me on this enlightening adventure!

On the earth of knowledge science and machine learning, a dedicated project environment is a MUST. It ensures a smooth workflow, facilitates collaboration, and promotes reproducibility. As an illustration, managing libraries and packages could be difficult. By creating an environment, you encapsulate dependencies and their specific versions, ensuring a conflict-free project. Furthermore, the precise software setup utilized in your project to enable others to breed your work accurately.

Remember, investing time upfront to ascertain a well-structured environment sets the stage for an efficient and arranged project workflow, empowering you to extract meaningful insights out of your data.

To create the environment, open the directory where you’ll be running your scripts.

Then, create a recent virtual environment with a command like this:

Replace “myenv” with the specified name on your virtual environment.

Finally, activate the virtual environment using the suitable command on your operating system. For those who’re using Windows, it may very well be something like this:

It is best to see your environment name on the left, indicating that you just’re running within the chosen virtual environment. Great! We’ve created our own virtual environment.

Ensure that to put in all of the libraries we’d like for this project. To put in a recent library, check with its documentation. Documentation is our greatest friend in Data Science, seriously. We’ll see the precise libraries we’ll be using in the subsequent steps after we get hands-on with programming. Let’s put this environment to make use of!

Now, let’s arrange and utilize our recent virtual environment in Visual Studio Code (VSCode), a preferred code editor. I personally love VSCode for its simplicity and ease-of-use in the case of coding. It provides a user-friendly interface, an intensive plugin ecosystem, and robust features that make it ideal for data science and machine learning projects. But hey, that is just my opinion, and it’s what we’ll be using here.

Open the folder you created in VSCode, and it’s best to see the newly created virtual environment there.

Press “ctrl + shift + p” and click on on ‘Python: Select Interpreter.’ Select the environment you created. Voila! We’re all set.

Before we start talking in regards to the data, open a recent Jupyter file in VSCode and import the suitable libraries we’ll must do data evaluation.

For our studies, we’ll be working with a printed dataset available in StatLib — -Datasets Archive. This website offers multiple datasets for learning purposes, which is implausible for individuals who need data to coach their skills. Specifically, we’ll be using the Boston house-price data from Harrison, D. and Rubinfeld, D.L.’s ‘Hedonic prices and the demand for clean air’ (J. Environ. Economics & Management, vol.5, 81–102, 1978).

But before diving into the code, let’s take a fast peek at what the info looks like.

The information is structured as a csv file and comprises information in regards to the data source, column names, definitions, and the actual data itself.

Now, let’s talk in regards to the unsung hero of knowledge science projects — data treatment. It’s just like the behind-the-scenes magician who makes the magic occur. Sure, it might take us hours, days, and even what appears like an eternity to wrangle and clean our data, but trust me, it’s value it!

Why? Because data treatment is the key sauce that turns messy, unruly data right into a well-behaved, reliable masterpiece. It’s like taming a wild beast, smoothing out inconsistencies, banishing missing values, and waving goodbye to outliers.

Sure, data treatment generally is a time-consuming process, and it would feel like we’re lost in a never-ending maze of transformations. BUT, with patience, perseverance, and the precise techniques, we will create really cool models, armed with clean, standardized, and feature-rich data.

In our case, step one is to load the info. Because it’s in CSV format, we’ll use . We’ll only load rows after the twenty second row because our data starts from there. Our goal variable, what we wish to predict, is the “MEDV” column representing the median value of owner-occupied homes.

Sometimes, depending on the source, we would need to extract features and targets individually. Though it’s not essential on this case, I’ll show you methods to do it for the aim of learning. As we saw in the info snippet, some columns are split across different lines. If we were to create a dataframe as is, we’d encounter many nulls for empty entries. So, we’ll extract all of the columns before the goal and arrange them side by side to repair their format.

Here’s how it really works:

§ raw_df.values returns a NumPy array representation of the raw_df DataFrame.

§ raw_df.values[::2, :] selects every second row (ranging from the primary row) and all columns of the raw_df.values array. This creates a recent array containing a subset of rows from the unique array.

§ raw_df.values[1::2, :2] selects every second row (ranging from the second row) and the primary two columns of the raw_df.values array (The third is the goal). This creates one other recent array containing a unique subset of rows and columns.

We pass these two subsets of arrays as a listing to the function, which horizontally stacks them together. The resulting array, , could have the combined columns from the 2 subsets of arrays. In other words, concatenates two subsets of arrays horizontally, making a recent array with a wider shape.

Now, let’s get our goal and column data:

Finally, let’s create dataframes from our newly created arrays, so now we have our data organized for further data treatment with .

That’s how our data looks now:

Before we wrap up this data treatment, it’s essential to perform data checks. Check for nulls, duplicates, and other potential issues to avoid misleading assumptions and forestall future crashes.

While I won’t show all of the outputs here, be happy to explore them yourself and see what they inform you. Trust me, the info passes all of the checks! In future posts, I’ll delve deeper into data treatment and methods to handle missing data.

Data evaluation plays a vital role within the machine learning pipeline. It involves exploring correlations, creating plots, and gaining insights into the info. By understanding the relationships between variables, identifying patterns, and making informed decisions, we will construct a strong machine learning model.

So, let’s dive into data evaluation:

First, we’ll use the function in pandas to get our initial insights into the info. This function provides summary statistics for every column:

The output reveals variations in column value sizes, indicating that scaling the features could potentially improve our model’s performance.

Next, let’s create some visualizations to realize a greater understanding of our data. We are able to start by making a correlation matrix plot using the function from the library:

This heatmap provides a visible representation of the correlations between different features in our dataset. The colour intensity indicates the strength of the correlation, with darker colours representing stronger correlations. The annotations on the heatmap show the correlation values.

The pair plot displays scatter plots for every pair of features against the goal variable, with regression lines fitted to capture the general trend. The kernel density estimation (KDE) plots on the diagonal show the distribution of every variable.

These plots provide invaluable insights into the relationships and patterns inside our data. They assist us discover potential correlations between features and the goal variable, in addition to understand the distributions of individual variables.

While some features show higher or lower correlations than others, none of them exhibit very large correlation values. This implies that individual variables alone will not be sufficient to clarify the info. Sometimes, it’s value excluding variables that add more noise to the info, but we won’t be doing that on this case.

Hey there! We’ve reached the exciting part: comparing machine learning models. But before we dive into the info, let’s quickly refresh our memory on two key concepts: Gradient Boosting and Linear Regression.

So, what’s the difference between Gradient Boosting and Linear Regression?

Gradient Boosting is a strong algorithm that mixes multiple weak learners to create a robust predictive model. It’s flexible, capturing non-linear relationships, and may handle different data types. Nevertheless, interpretability is proscribed, and handling outliers requires additional techniques.

Alternatively, Linear Regression predicts a continuous goal variable based on linear relationships between features. It’s simpler and interpretable, nevertheless it’s sensitive to outliers and best fitted to numerical features.

In summary, Gradient Boosting is complex and powerful, but less interpretable, while Linear Regression is less complicated, interpretable, and works well for linear relationships in numerical features. The alternative is dependent upon the complexity needs, interpretability, and handling of outliers.

To make things easier, let’s start by making a root mean square error function that we’ll use later for our testing.

Now, we’ll follow these steps for each machine learning models:

We are going to use LinearRegression for example for step-by-step process: We’ll separate the goal variable (‘MEDV-TARGET’) from the input features (‘X’) and assign them to variables X and y, respectively.

We’ll standardize the input features (X) using the StandardScaler() from scikit-learn. This step ensures that each one features have the same scale, stopping anybody feature from dominating the educational process. It enhances performance and stability, allowing fair comparisons between features.

We’ll split the scaled input features (X_scaled) and the goal variable (y) into training and test sets using the train_test_split() function. This permits us to guage the model’s performance on unseen data and detect overfitting issues.

We’ll create an instance of the machine learning model and assign it to the variable model. This sets up the model’s structure and algorithms for learning patterns from the info.

The code then performs K-fold cross-validation using the KFold() function with 10 splits. K-fold cross-validation is a sturdy technique for assessing the performance and generalization of a machine learning model. In our code, K-fold cross-validation is performed using the KFold() function with 10 splits. This implies the info is split into 10 equal parts or “folds,” and the model is trained and evaluated 10 times, every time using a unique fold because the validation set and the remaining folds because the training set. This process provides a more comprehensive evaluation of the model’s performance by considering multiple subsets of the info.

During each iteration of the K-fold cross-validation loop, the model is trained on a selected training fold and evaluated on the validation fold. The most effective model is chosen based on the best R-squared value, which measures how well the model suits the info.

Once one of the best model is set through cross-validation, it’s applied to all the dataset (X_scaled) to generate predictions for the goal variable. These predictions are stored within the variable y_pred, and one of the best model itself is saved within the variable model.

We’ll calculate the R-squared value and root mean-square deviation to guage the performance of one of the best model. These metrics provide insights into how well the model suits the info and makes predictions.

We are going to repeat the identical approach for our Gradient Boosting model, except that using the suitable machine learning module from sckit-learn:

Wondering methods to check which model is preforming higher? Let’s start by making a table to check the performance metrics of Linear Regression and Gradient Boosting. The table will include the R-squared values and root mean-square error (Root-MSE) values.

The outcomes shows that for the Linear Regression model, the R-squared value is 0.797895, indicating that roughly 79.79% of the variance within the goal variable could be explained by the model. The Root-MSE value is 4.687595, representing the typical deviation between the actual goal values and the anticipated values, with a lower value indicating higher predictive accuracy.

Alternatively, the Gradient Boosting model demonstrates higher performance. It achieves a formidable R-squared value of 0.970070, suggesting that around 97.01% of the variance within the goal variable is accounted for by the model. Moreover, the Root-MSE value is significantly lower at 1.589545, indicating the next level of predictive accuracy in comparison with the Linear Regression model.

These metrics provide insights into how well the models fit the info and make predictions.

Now, let’s visualize this difference using plots.

First, I’ll create specific training sets for visualization purposes.

Next, we’ll use a scatter plot to visualise the performance of the models.

The scatter plot compares the anticipated values generated by the Linear Regression and Gradient Boosting models with the actual goal values from the test set. The points representing the Linear Regression predictions are plotted alongside those of the Gradient Boosting predictions. As expected, the plot clearly shows that the Gradient Boosting model exhibits superior prediction power in comparison with Linear Regression.

But we’re not done yet. Let’s analyze the residuals!

Residual plots are essential for comparing models as they supply invaluable insights into their performance and effectiveness. Residuals represent the differences between predicted and actual values within the dataset. By examining the patterns and distribution of residuals, we will understand how well the models capture the underlying relationships in the info.

Comparing residual plots allows us to evaluate the strengths and weaknesses of various models. An excellent model should exhibit random scattering of points across the horizontal zero line, indicating effective capture of knowledge variability. The absence of discernible patterns or trends suggests that the model has accounted for all relevant aspects.

Now, let’s calculate the residuals for every model using the identical test data.

Using seaborn, we will visualize the difference in residuals for the 2 models.

One other option to visualize residuals is thru a histogram plot. Ideally, the closer the info is to zero, the higher the model’s performance.

After examining the residual plots, it’s evident that each the scatter plot and histogram of the Gradient Boosting model display a greater alignment with the zero line in comparison with Linear Regression. This means that the Gradient Boosting model captures underlying patterns and variability in the info more accurately. The scatter plot shows a concentrated distribution of points across the zero line, suggesting closer predictions to actual values. Similarly, the histogram demonstrates a balanced distribution of residuals around zero, indicating minimal bias within the model’s predictions.

Now that we’ve discovered our ultimate model, let’s save our scaled data for the upcoming streamlit application. Don’t sweat it an excessive amount of for now — I’ll explain why in only a bit.

Saving trained models is crucial in machine learning projects for future use and deployment. Python’s module is a preferred alternative for model serialization, offering several benefits. Firstly, saving the model enables reusability, eliminating the necessity for retraining from scratch. This is very invaluable for complex models or large datasets, saving time and computational resources. Secondly, it ensures a consistent and reproducible workflow by capturing the model’s parameters, architecture, and preprocessing steps. This permits for precise result replication, even with changes within the training environment or dependencies, promoting reproducibility and collaboration. Moreover, using facilitates model deployment in production environments and sharing with others who lack access to the unique code or data. By saving the model as a file, it may well be easily distributed and employed in diverse applications or systems.

This could be very simply done using this line of code:

Storing predictions in a database offers several benefits in machine learning projects. Moreover, integration with other systems and tools becomes seamless, facilitating real-time applications and data transfer. Collaboration and data sharing are enhanced, as multiple team members can access and query the predictions stored within the database. Leveraging a database of predictions, we will create a user-friendly Streamlit application that empowers users to make predictions and gain invaluable insights effortlessly.

Streamlit is a Python library that means that you can define the logic and behavior of your application using Python code. The Python file serves because the entry point for running the Streamlit application. It comprises the essential code to define UI components, process data, and handle user interaction. While you execute the Python file, Streamlit runs a neighborhood server that hosts your application and displays it in an online browser.

So, we’ll create a fresh recent .py file in the environment with the suitable libraries:

You don’t need all those libraries, but I’ll keep them for learning purposes.

To establish the required functionality, we’ll create functions to ascertain a SQL connection and create a table using the sqlite3 library in Python. These functions will help us create and interact with our database. Once the database is ready up, we’ll create a table to store the anticipated values and corresponding input feature values.

The create_table() function is then used to execute the SQL query and create the ‘output_MEDV’ table within the database. This function takes the database connection object (conn) and the SQL query (query_MEDV_output) as parameters. It executes the query and creates the table if it doesn’t exist already.

First, we load the scaled input data from ‘X_scaled.csv’, which comprises the values we used to coach our model. This data helps us create a consistent scaling process for brand spanking new user inputs. Then, we extract the input features from the info and fit a StandardScaler to capture the scaling parameters.

Next, it’s time for user interaction! we’ll prompt the user to enter values for every input feature using Streamlit’s intuitive number input fields. These values will likely be stored and converted right into a NumPy array. To make sure the user input aligns with the model’s expectations, we’ll scale it using the fitted scaler. The scaled input will likely be ready for prediction. We’ll utilize our pre-trained model (e.g., ‘model_gb_boston.pkl’) to make accurate predictions based on the scaled user input.

Using Streamlit, with only a click of a button, the prediction will likely be exhibited to the user. Moreover, we’ll store the user input and the corresponding prediction in a relational database. This permits us to revisit and analyze the predictions later, adding a touch of knowledge management finesse to our application.

Now let’s run this streamlit app:

In your terminal, using the environment you created run the next streamlit code:

Replace with the filename of your Python file containing the Streamlit application code.

After running the command, your terminal should display some information and eventually look much like this:

Once the Streamlit app is running, an online browser will open mechanically, displaying the appliance.

You may now interact with the app, enter input values, and think about the predictions generated by the model!

Through this post, my aim was to supply a summarized and easy-to-understand guide for creating predicting models using machine learning. We covered various key elements of the method, including data preparation, exploratory data evaluation, modeling, model comparison, saving the model, making a database, and constructing a Streamlit application.

We even spiced things up by comparing different machine learning models (Gradient Boosting and Linear Regression) and finding one of the best one for the duty at hand. By considering metrics like R-squared and mean squared error (MSE), we unlocked the secrets of accuracy and effectiveness in model selection.

By sharing my insights and experiences, I hope this post serves as a helpful start line for anyone fascinated with delving into similar projects. Whether you’re recent to machine learning or trying to expand your skills, the knowledge gained from this guide will empower you to predict housing prices and explore the exciting field of machine learning with confidence.

For those who’ve enjoyed reading this post and would really like to remain connected, don’t hesitate to follow me! I’d like to share more insights, suggestions, and exciting projects with you and I’m open to any feedback you will have. Your input will help me improve and supply even higher content in the long run. : )

Pandas: https://pandas.pydata.org/

NumPy: https://numpy.org/doc/

Seaborn: https://seaborn.pydata.org/

Matplotlib: https://matplotlib.org/stable/contents.html

Scikit-learn (sklearn): https://scikit-learn.org/stable/

SQLite3 (Python built-in module): https://docs.python.org/3/library/sqlite3.html

Streamlit: https://docs.streamlit.io/en/stable/

Pickle: https://docs.python.org/3/library/pickle.html

StandardScaler (from scikit-learn): https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

GradientBoostingRegressor (from scikit-learn): https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html

LinearRegression (from scikit-learn): https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

KFold (from scikit-learn): https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html