6 Underdog Data Science Libraries That Deserve Much More Attention

Artificial Intelligence

6 Underdog Data Science Libraries That Deserve Much More Attention

admin

April 16, 2023

6 Underdog Data Science Libraries That Deserve Much More Attention

Time to exit of the shadows

bexgboost_a_fleet_of_tiny_boats._Beautiful_cinematic_lightning._a74e62d1-61d3-4b3f-aa00-1b998b7210a3.png — Image by me via Midjourney.

While the massive guys, Pandas, Scikit-learn, NumPy, Matplotlib, TensorFlow, etc., hog all of your attention, it is simple to miss some down-to-earth and yet, incredible libraries.

They is probably not GitHub rock stars, or taught in expensive Coursera specializations, but 1000’s of open-source developers pour their blood and sweat into writing them. They quietly fill the gaps left by popular libraries from the shadows.

The aim of this text is to shine a lightweight on a few of these libraries and marvel together at how powerful the open-source community may be.

Let’s start!

0. Manim

Image from the Manim GitHub page. MIT License.

We’re all wowed and stunned at just how beautiful 3Blue1Brown videos are. But most of us don’t know that every one the animations are created using the Mathematical Animation Engine (Manim) library written by Grant Sanderson himself. (We take Grant Sanderson a lot for granted.)

Each 3b1b video is powered by 1000’s of lines of code written in Manim. For example, the legendary “The Essence of Calculus” series took Grant Sanderson over 22k lines of code.

In Manim, each animation is represented by a scene class just like the following (don’t worry for those who don’t understand it):

import numpy as np
from manim import *class FunctionExample(Scene):
def construct(self):
axes = Axes(...)
axes_labels=axes.get_axis_labels()
# Get the graph of an easy functions
graph = axes.get_graph(lambda x: np.sin(1/x), color=RED)
# Arrange its label
graph_label = axes.get_graph_label(
graph, x_val=1, direction=2 * UP + RIGHT,
label=r'f(x) = sin(frac{1}{x})', color=DARK_BLUE
)
# Graph the axes components together
axes_group = VGroup(axes, axes_labels)
# Animate
self.play(Create(axes_group), run_time=2)
self.wait(0.25)
self.play(Create(graph), run_time=3)
self.play(Write(graph_label), run_time=2)

Which produces the next animation of the function sin(1/x):

Unfortunately, Manim just isn’t well-maintained and documented, as, understandably, Grant Sanderson spends most of his efforts on making the awesome videos.

But, there may be a community fork of the library by Manim Community, that gives higher support, documentation, and learning resources.

For those who got too excited (you math lover!) already, here is my gentle but thorough introduction to Manim API:

Stats and links:

Due to its steep learning curve and complicated installation, Manim gets only a few downloads every month. It deserves so way more attention.

1. PyTorch Lightning

Screenshot of PyTorch Lightning GitHub page. Apache-2.0 license.

Once I began learning PyTorch after TensorFlow, I became very grumpy. It was obvious that PyTorch was powerful but I couldn’t help but say “TensorFlow does this higher”, or “That might have been much shorter in TF”, and even worse, “I almost wish I never learned PyTorch”.

That’s because PyTorch is a low-level library. Yes, this implies PyTorch gives you complete control over the model training process, nevertheless it requires numerous boilerplate code. It’s like TensorFlow but five years younger if I’m not mistaken.

Seems, there are quite many individuals who feel this manner. More specifically, almost 830 contributors at Lightning AI, developed PyTorch Lightning.

GIF by PyTorch Lightning GitHub page. Apache-2.0 license.

PyTorch lightning is a high-level wrapper library built around PyTorch that abstracts away most of its boilerplate code and soothes all its pain points:

Hardware-agnostic models
Code is extremely readable because engineering code is handled by Lightning modules
Flexibility is undamaged (all Lightning modules are still PyTorch modules)
Multi-GPU, multi-node, TPU support
16-bit precision
Experiment tracking
Early stopping and model checkpointing (finally!)

and other, near 40 advanced features, all designed to please AI researchers fairly than infuriate them.

Stats and links:

Learn from the official tutorials:

2. Optuna

Yes, hyperparameter tuning with GridSearch is simple, comfortable, and only a single import statement away. But you need to surely admit that it’s slower than a hungover snail and really inefficient.

bexgboost_infinite_number_of_supermarket_aisles_in_an_orderly_f_471fc79d-cd62-40a0-bb3a-1eeb75d35507.png — Image by me via Midjourney.

For a moment, consider hyperparameter tuning as grocery shopping. Using GridSearch means taking place each aisle in a supermarket and checking every product. It’s a scientific and orderly approach but you waste a lot time.

However, if you will have an intelligent personal shopping assistant with Bayesian roots, you’ll know exactly what you would like and where to go. It’s a more efficient and targeted approach.

For those who like that assistant, its name is Optuna. It’s a Bayesian hyperparameter optimization framework to look the given hyperparameter space efficiently and find the golden set of hyperparameters that give the very best model performance.

Listed here are a few of its best features:

Framework-agnostic: tunes models of any machine learning model you possibly can consider
Pythonic API to define search spaces: as a substitute of manually listing possible values for a hyperparameter, Optuna helps you to sample them linearly, randomly, or logarithmically from a given range
Visualization: supports hyperparameter importance (parallel coordinate) plots, history plots, and slice plots
Control the number or duration of iterations: Set the precise variety of iterations or the utmost time duration the tuning process lasts
Pause and resume the search
Pruning: stop unpromising trials before they begin

All these features are designed to avoid wasting time and resources. If you desire to see them in motion, try my tutorial on Optuna (it’s one in all my best-performing articles amongst 150):

Stats and links:

3. PyCaret

Screenshot of the PyCaret GitHub page. MIT license.

I even have enormous respect for Moez Ali for creating this library from the bottom up on his own. Currently, PyCaret is the very best low-code machine learning library on the market.

If PyCaret was advertised on TV, here’s what the ad would say:

“Are you bored with spending hours writing virtually the identical code in your machine learning workflows? Then, PyCaret is the reply!

Our all-in-one machine learning library helps you to construct and deploy machine learning models in as few lines of code as possible. Consider it as a cocktail containing code from all of your favorite machine learning libraries like Scikit-learn, XGBoost, CatBoost, LightGBM, Optuna, and lots of others.”

Then, the ad would show this snippet of code, with dramatic popping noises to display each line:

# Classification OOP API Example# loading sample dataset
from pycaret.datasets import get_data
data = get_data('juice')
# init setup
from pycaret.classification import ClassificationExperiment
s = ClassificationExperiment()
s.setup(data, goal = 'Purchase', session_id = 123)
# model training and selection
best = s.compare_models()
# evaluate trained model
s.evaluate_model(best)
# predict on hold-out/test set
pred_holdout = s.predict_model(best)
# predict on recent data
new_data = data.copy().drop('Purchase', axis = 1)
predictions = s.predict_model(best, data = new_data)
# save model
s.save_model(best, 'best_pipeline')

The narrator would say on voiceover because the code is being displayed:

“With a number of lines of code, you possibly can train and select the very best from dozens of models from different frameworks, evaluate them on a hold-out set, and save them for deployment. It is very easy to make use of, anyone can do it!

Hurry up and grab a duplicate of our software from GitHub, through PIP, and thank us later!”

Stats and links:

4. BentoML

Web developers love FastAPI like their pets. It’s one of the vital popular GitHub projects and admittedly, makes API development stupidly easy and intuitive.

For this reason popularity, it also made its way into machine learning. It’s common to see engineers deploying their models as APIs using FastAPI, pondering the entire process couldn’t get any higher or easier.

But most are under an illusion. Simply because FastAPI is so significantly better than its predecessor (Flask), it doesn’t mean it’s the very best tool for the job.

Well, then, what is the very best tool for the job? I’m so glad you asked — BentoML!

BentoML, though relatively young, is an end-to-end framework to package and ship models of any machine learning library to any cloud platform.

Image from BentoML home page taken with permission.

FastAPI was designed for web developers, so it had many obvious shortcomings in deploying ML models. BentoML solves all of them:

to avoid wasting/load models
to version and keep track of models
of models with a single line of terminal code
Serving models on
Deploying models as with a single short script and a number of terminal commands to any cloud provider

I’ve already written a number of tutorials on BentoML. Here is one in all them:

Stats and links:

5. PyOD

bexgboost_an_army_of_robots_against_a_lonely_robot._Dramatic_ci_3689bb13-1272-45c6-ab37-e4c24e6325cf.png — Image by me via Midjourney.

This library is an underdog, since the problem it solves, outlier detection, can also be an underdog.

Virtually any machine learning course you are taking only teaches z-scores for outlier detection and moves on to fancier concepts and tools like R (sarcasm).

But outlier detection is so way more than plain z-scores. There may be modified z-scores, Isolation Forests (cool name), KNN for anomalies, Local Outlier Factor, and 30+ other state-of-the-art anomaly detection algorithms packed into the Python Outlier Detection toolkit (PyOD).

When not detected and handled properly, outliers will skew the mean and standard deviation of features and create noise in training data — scenarios you don’t want happening in any respect.

That’s PyOD’s life purpose — provide tools to facilitate finding anomalies. Other than its big selection of algorithms, it’s fully compatible with Scikit-learn, making it easy to make use of in existing machine-learning pipelines.

For those who are still not convinced concerning the importance of anomaly detection and the role PyOD plays in it, I highly recommend giving this text a read (written by yours truly):

Stats and links:

6. Sktime

Image from the Sktime GitHub page. BSD-3 Clause License.

Time machines aren’t any longer things of science fiction. It’s a reality in the shape of Sktime.

As an alternative of jumping between time periods, Sktime performs the barely less cool task of time series evaluation.

It borrows the very best tools of its big brother, Scikit-learn to perform the next time series tasks:

Classification
Regression
Clustering (this one is fun!)
Annotation
Forecasting

It features over 30 state-of-the-art algorithms with a well-recognized Scikit-learn syntax and in addition offers pipelining, ensembling, and model tuning for each univariate and multivariate time series data.

It’s also thoroughly maintained — Sktime contributors work like bees.

Here’s a tutorial on it (not mine, alas):

Stats and links:

Wrap

While our every day workflows are dominated by popular tools like Scikit-learn, TensorFlow, or PyTorch, it is vital to not overlook the lesser-known libraries.

They could not have the identical level of recognition or support, but in the fitting hands, they supply elegant solutions to problems not addressed by their popular counterparts.

This text focused on only six of them, but you possibly can ensure there are a whole lot of others. All you will have to do is a few exploring!

Loved this text and, let’s face it, its bizarre writing style? Imagine accessing dozens more similar to it, all written by a superb, charming, witty creator (that’s me, by the best way :).

For less than 4.99$ membership, you’ll get access to not only my stories, but a treasure trove of information from the very best and brightest minds on Medium. And for those who use my referral link, you’ll earn my supernova of gratitude and a virtual high-five for supporting my work.