Announcing PyCaret 3.0 β€” An open-source, low-code machine learning library in Python In this text: Introduction πŸ“ˆ Stable Time Series Forecasting Module πŸ’» Object Oriented API πŸ“Š More options for Experiment Logging 🧹 Refactored Preprocessing Module βœ… Compatibility with the newest sklearn version πŸ”— Distributed Parallel Model Training πŸš€ Speed up Model Training on CPU ⚰️ ️RIP: NLP and Arules module ℹ️ More Information Contributors Liked the blog? Connect with Moez Ali

-

Exploring the Latest Enhancements and Features of PyCaret 3.0

Generated by Moez Ali using Midjourney
  1. Introduction
  2. Stable Time Series Forecasting Module
  3. Recent Object Oriented API
  4. More options for Experiment Logging
  5. Refactored Preprocessing Module
  6. Compatibility with the newest sklearn version
  7. Distributed Parallel Model Training
  8. Speed up Model Training on CPU
  9. RIP: NLP and Arules module
  10. More Information
  11. Contributors

PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It’s an end-to-end machine learning and model management tool that exponentially hurries up the experiment cycle and makes you more productive.

Compared with the opposite open-source machine learning libraries, PyCaret is an alternate low-code library that could be used to switch tons of of lines of code with a couple of lines only. This makes experiments exponentially fast and efficient. PyCaret is actually a Python wrapper around several machine learning libraries and frameworks in Python.

The design and ease of PyCaret are inspired by the emerging role of citizen data scientists, a term first utilized by Gartner. Citizen Data Scientists are power users who can perform each easy and moderately sophisticated analytical tasks that might previously have required more technical expertise.

To learn more about PyCaret, try our GitHub or Official Docs.

Try our full Release Notes for PyCaret 3.0

PyCaret’s Time Series module is now stable and available under 3.0. Currently, it supports forecasting tasks, but it surely is planned to have time-series anomaly detection and clustering algorithms available in the longer term.

# load dataset
from pycaret.datasets import get_data
data = get_data('airline')

# init setup
from pycaret.time_series import *
s = setup(data, fh = 12, session_id = 123)

# compare models
best = compare_models()

# forecast plot
plot_model(best, plot = 'forecast')
# forecast plot 36 days out in future
plot_model(best, plot = 'forecast', data_kwargs = {'fh' : 36})

Although PyCaret is a improbable tool, it doesn’t adhere to the standard object-oriented programming practices utilized by Python developers. To handle this issue, we needed to rethink among the initial design decisions we made for the 1.0 version. It is crucial to notice that it is a significant change that may require considerable effort to implement. Now, let’s explore how it will affect you.

# Functional API (Existing)

# load dataset
from pycaret.datasets import get_data
data = get_data('juice')

# init setup
from pycaret.classification import *
s = setup(data, goal = 'Purchase', session_id = 123)

# compare models
best = compare_models()

It’s great to do experiments in the identical notebook, but when you desire to run a special experiment with different setup function parameters, this generally is a problem. Even though it is feasible, the previous experiment’s settings can be replaced.

Nevertheless, with our latest object-oriented API, you may effortlessly conduct multiple experiments in the identical notebook and compare them with none difficulty. It’s because the parameters are linked to an object and could be related to various modeling and preprocessing options.

# load dataset
from pycaret.datasets import get_data
data = get_data('juice')

# init setup 1
from pycaret.classification import ClassificationExperiment

exp1 = ClassificationExperiment()
exp1.setup(data, goal = 'Purchase', session_id = 123)

# compare models init 1
best = exp1.compare_models()

# init setup 2
exp2 = ClassificationExperiment()
exp2.setup(data, goal = 'Purchase', normalize = True, session_id = 123)

# compare models init 2
best2 = exp2.compare_models()

exp1.compare_models
exp2.compare_models

After conducting experiments, you may utilize the get_leaderboard function to create leaderboards for every experiment, making it easier to check them.

import pandas as pd

# generate leaderboard
leaderboard_exp1 = exp1.get_leaderboard()
leaderboard_exp2 = exp2.get_leaderboard()
lb = pd.concat([leaderboard_exp1, leaderboard_exp2])

Output truncated
# print pipeline steps
print(exp1.pipeline.steps)
print(exp2.pipeline.steps)

PyCaret 2 can mechanically log experiments using MLflow . While it continues to be the default, there are more options for experiment logging in PyCaret 3. The newly added options in the newest version are wandb, cometml, dagshub .

To vary the logger from default MLflow to other available options, simply pass certainly one of the next within thelog_experiment parameter. β€˜mlflow’, β€˜wandb’, β€˜cometml’, β€˜dagshub’.

The preprocessing module underwent an entire redesign to enhance its efficiency and performance, in addition to to make sure compatibility with the newest version of Scikit-Learn.

PyCaret 3 includes several latest preprocessing functionalities, corresponding to revolutionary categorical encoding techniques, support for text features in machine learning modeling, novel outlier detection methods, and advanced feature selection techniques.

A number of the latest features are:

  • Recent categorical encoding methods
  • Handling text features for machine learning modeling
  • Recent methods to detect outliers
  • Recent methods for feature selection
  • Guarantee to avoid goal leakage as the complete pipeline is now fitted at a fold level.

PyCaret 2 relies heavily on scikit-learn 0.23.2, which makes it unattainable to make use of the newest scikit-learn version (1.X) concurrently with PyCaret in the identical environment.

PyCaret is now compatible with the newest version of scikit-learn, and we would really like to maintain it that way.

To scale on large datasets, you may run compare_models function on a cluster in distributed mode. To try this, you should use the parallel parameter within the compare_models function.

This was made possible due to Fugue, an open-source unified interface for distributed computing that lets users execute Python, Pandas, and SQL code on Spark, Dask, and Ray with minimal rewrites

# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')

# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, goal = 'Class variable', n_jobs = 1)

# create pyspark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# import parallel back-end
from pycaret.parallel import FugueBackend

# compare models
best = compare_models(parallel = FugueBackend(spark))

You may apply Intel optimizations for machine learning algorithms and speed up your workflow. To coach models with Intel optimizations use sklearnex engine, installation of Intel sklearnex library is required:

# install sklearnex
pip install scikit-learn-intelex

To make use of the intel optimizations, simply pass engine = 'sklearnex' within the create_model function.

# Functional API (Existing)

# load dataset
from pycaret.datasets import get_data
data = get_data('bank')

# init setup
from pycaret.classification import *
s = setup(data, goal = 'deposit', session_id = 123)

Model training without intel accelerations:

%%time
lr = create_model('lr')

Model training with intel accelerations:

%%time
lr2 = create_model('lr', engine = 'sklearnex')

There are some differences in model performance (immaterial typically) but the development in timing is ~60% on a 30K rows dataset. The profit is far higher when coping with larger datasets.

NLP is changing fast, and there are numerous dedicated libraries and firms working exclusively to unravel end-to-end NLP tasks. Resulting from lack of resources, existing expertise within the team, and latest contributors willing to keep up and support NLP and Arules, we’ve got decided to drop them from PyCaret. PyCaret 3.0 doesn’t have nlp and arules module. It has also been faraway from the documentation. You may still use them with the older version of PyCaret.

πŸ“š Docs Getting began with PyCaret

πŸ“ API Reference Detailed API docs

⭐ Tutorials Recent to PyCaret? Try our official notebooks

πŸ“‹ Notebooks created and maintained by the community

πŸ“™ Blog Tutorials and articles by contributors

πŸ“Ί Videos Video tutorials and events

πŸŽ₯ YouTube Subscribe our YouTube channel

πŸ€— Slack Join our slack community

πŸ’» LinkedIn Follow our LinkedIn page

πŸ“’ Discussions Engage with the community and contributors

πŸ› οΈ Release Notes

Because of all of the contributors who’ve participated in PyCaret 3.

@ngupta23
@Yard1
@tvdboom
@jinensetpal
@goodwanghan
@Alexsandruss
@daikikatsuragawa
@caron14
@sherpan
@haizadtarik
@ethanglaser
@kumar21120
@satya-pattnaik
@ltsaprounis
@sayantan1410
@AJarman
@drmario-gh
@NeptuneN
@Abonia1
@LucasSerra
@desaizeeshan22
@rhoboro
@jonasvdd
@PivovarA
@ykskks
@chrimaho
@AnthonyA1223
@ArtificialZeng
@cspartalis
@vladocodes
@huangzhhui
@keisuke-umezawa
@ryankarlos
@celestinoxp
@qubiit
@beckernick
@napetrov
@erwanlc
@Danpilz
@ryanxjhan
@wkuopt
@TremaMiguel
@IncubatorShokuhou
@moezali1

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

1 COMMENT

0 0 votes
Article Rating
guest
1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

1
0
Would love your thoughts, please comment.x
()
x