Scaling Feature Engineering Pipelines with Feast and Ray

-

project involving the construct of propensity models to predict customers’ prospective purchases, I encountered feature engineering issues that I had seen quite a few times before.

These challenges might be broadly classified into two categories:

1) Inadequate Feature Management

  • Definitions, lineage, and versions of features generated by the team weren’t systematically tracked, thereby limiting feature reuse and reproducibility of model runs.
  • Feature logic was manually maintained across separate training and inference scripts, resulting in a risk of inconsistent features for training and inference (i.e., training-serving skew)
  • Features were stored as flat files (e.g., CSV), which lack schema enforcement and support for low-latency or scalable access.

2) High Feature Engineering Latency

  • Heavy feature engineering workloads often arise when coping with time-series data, where multiple window-based transformations have to be computed.
  • When these computations are executed sequentially relatively than optimized for parallel execution, the latency of feature engineering can increase significantly.

In this text, I clearly explain the concepts and implementation of feature stores (Feast) and distributed compute frameworks (Ray) for feature engineering in production machine learning (ML) pipelines.

Contents

Yow will discover the accompanying GitHub repo here.


(i) Objective

For example the capabilities of Feast and Ray, our example scenario involves constructing an ML pipeline to coach and serve a 30-day customer purchase propensity model.


(ii) Dataset

We’ll use the UCI Online Retail dataset (CC BY 4.0), which comprises purchase transactions for a UK online retailer between December 2010 and December 2011.

Fig. 1 — Sample rows of UCI Online Retail dataset | Image by writer

(iii) Feature Engineering Approach

We will keep the feature engineering scope easy by limiting it to the next features (based on a 90-day lookback window unless otherwise stated):

Recency, Frequency, Monetary Value (RFM) features

  • recency_days: Days since last purchase
  • frequency: Variety of distinct orders
  • monetary: Total monetary spend
  • tenure_days: Days since first-ever purchase (all-time)

Customer behavioral features

  • avg_order_value: Mean spend per order
  • avg_basket_size: Mean variety of items per order
  • n_unique_products: Product diversity
  • return_rate: Share of cancelled orders
  • avg_days_between_purchases: Mean days between purchases

(iv) Rolling Window Design

The features are computed from a 90-day window before each cutoff date, and buy labels (1 = at the least one purchase, 0 = no purchase) are computed from a 30-day window after each cutoff.

On condition that the cutoff dates are spaced 30 days apart, it produces nine snapshots from the dataset:

Fig. 2 — Rolling window timeline for features generation and prediction labels | Image by writer

(i) About Feast

Firstly, let’s understand what a feature store is. 

A feature store is a centralized data repository that manages, stores, and serves machine learning features, acting as a single source of truth for each training and serving.

Feature stores offer key advantages in managing feature pipelines:

  • Implement consistency between training and serving data
  • Prevent data leakage by ensuring features use only data available on the time of prediction (i.e., point-in-time correct data)
  • Allow cross-team reuse of features and have pipelines
  • Track feature versions, lineage, and metadata for governance

Feast (short for Feature Store) is an open-source feature store that delivers feature data at scale during training and inference.

It integrates with multiple database backends and ML frameworks that may work across or off cloud platforms.

Fig 3. — Feast architecture. Note that data transformation for feature engineering typically sits outside of the Feast framework | Image used under Apache License 2.0

Feast supports each online (for real-time inference) and offline (for batch predictions), though our focus is on offline features, as batch prediction is more relevant for our purchase propensity use case.


(ii) About Ray

Ray is an open-source general-purpose distributed computing framework designed to scale ML applications from a single machine to large clusters. It might probably run on any machine, cluster, cloud provider, or Kubernetes.

Ray offers a spread of capabilities, and the one we’ll use is the core distributed runtime called Ray Core

Fig. 4 — Overview of the Ray framework | Image used under Apache License 2.0

Ray Core provides low-level primitives for the parallel execution of Python functions as distributed tasks and for managing tasks across available compute resources.


Let’s take a look at the areas where Feast and Ray help address feature engineering challenges.

(i) Feature Store Setup with Feast

For our case, we’ll arrange an offline feature store using Feast. Our RFM and customer behavior features will likely be registered within the feature store for centralized access.


(ii) Feature Retrieval with Feast and Ray

With our Feast feature store ready, we will enable the retrieval of relevant features from it during each stages of model training and inference.

We must first be clear about these three concepts: Entity, Feature, and Feature View.

  • An entity is the first key used to retrieve features. It principally refers back to the identifier “object” for every feature row (e.g., user_id, account_id, etc)
  • A feature is a single typed attribute related to each entity (e.g., avg_basket_size)
  • A feature view defines a gaggle of related features for an entity, sourced from a dataset. Consider it as a table with a primary key (e.g., user_id) being coupled with relevant feature columns.
Fig. 5 — Example illustration of entity, feature, and have view | Image by writer

Say we now need to obtain these offline features for training or inference. Here’s the way it is completed:

  1. An entity DataFrame is first created, containing the entity keys and an event timestamp for every row. It corresponds to the 2 left-most columns in Fig. 5 above.
  2. A point-in-time correct join occurs between the entity DataFrame and the feature tables defined by the various Feature Views

The output is a combined dataset containing all of the requested features for the desired set of entities and timestamps.

So where does Ray are available in here?

The Ray Offline Store is a distributed compute engine that allows faster, more scalable feature retrieval, especially for giant datasets. It does so by parallelizing data access and join operations:

  • Data (I/O) Access: Distributed data reads by splitting Parquet files across multiple staff, where each employee reads a special partition in parallel
  • Join Operations: Splits the entity DataFrame in order that each partition independently performs temporal joins to retrieve the feature values per entity before a given timestamp. With multiple feature views, Ray parallelizes the computationally intensive joins to scale efficiently.

(iii) Feature Engineering with Ray

The feature engineering function for generating RFM and customer behavior features have to be applied to every 90-day window (i.e., nine independent cutoff dates, each requiring the identical computation).

Ray Core turns each function call right into a distant task, enabling the feature engineering to run in parallel across available cores (or machines in a cluster). 


(4.1) Initial Setup

We install the next Python dependencies:

feast[ray]==0.60.0
openpyxl==3.1.5
psycopg2-binary==2.9.11
ray==2.54.0
scikit-learn==1.8.0
xgboost==3.2.0

As we’ll use PostgreSQL for the feature registry, be sure that Docker is installed and running before running docker compose up -d to begin the PostgreSQL container.


(4.2) Prepare Data 

Besides data ingestion and cleansing, there are two preparation steps to execute:

  • Rolling Cutoff Generation: Creates nine snapshots spaced 30 days apart. Each cutoff date defines a training/prediction point at which features are computed from the 90 days preceding it, and goal labels are computed from the 30 days after it.
  • Label Creation: For every cutoff, create a binary goal label indicating whether a customer made at the least one purchase inside the 30-day window after the cutoff.

(4.3) Run Ray-Based Feature Engineering

After defining the code to generate RFM and customer behavior features, let’s parallelize the execution using Ray for every rolling window.

We start by making a function (compute_features_for_cutoff) to wrap all of the relevant feature engineering steps for each cutoff:

The @ray.distant decorator registers the function as a distant task to be run asynchronously in separate staff.

The information preparation and have engineering pipeline is then run as follows:

Here’s how Ray is involved within the pipeline:

  • ray.init() initiates a Ray cluster and enables distributed execution across all local cores by default. 
  • ray.put(df) stores the cleaned DataFrame in Ray’s shared memory (aka distributed object store) and returns a reference (ObjectRef) so that each one parallel tasks can access the DataFrame without copying it. This helps to enhance memory efficiency and task launch performance
  • compute_features_for_cutoff.distant(...) sends our feature computation tasks to Ray’s scheduler, where Ray assigns each task to a employee for parallel execution and returns a reference to every task’s output.
  • futures = [...] stores all references returned by each .distant() call. They represent all of the in-flight parallel tasks which were launched
  • ray.get(futures) retrieves all of the actual return values from the parallel task executions at one go
  • The script then extracts and concatenates per-cutoff RFM and behavior features into two DataFrames, saves them as Parquet files locally
  • ray.shutdown() releases the resources allocated by stopping the Ray runtime

While our features are stored locally on this case, do note that offline feature data is often stored in data warehouses or data lakes (e.g., S3, BigQuery, etc) in production settings.


(4.4) Arrange Feast Feature Registry

Thus far, we have now covered the transformation and storage points of feature engineering. Allow us to move on to the Feast feature registry.

A feature registry is the centralized catalog of feature definitions and metadata that serves as a single source of truth for feature information.

There are two key components within the registry setup: Definitions and Configuration.


Definitions

We first define the Python objects to represent the features engineered up to now. For instance, certainly one of the primary objects to find out is the Entity (i.e., the first key that links the feature rows):

Next, we define the data sources through which our feature data are stored:

Note that the timestamp_field is critical because it enables correct point-in-time data views and joins when features are retrieved for training or inference.

After defining entities and data sources, we are able to define the feature views. On condition that we have now two sets of features (RFM and customer behavior), we expect to have two feature views:

The schema (field names, dtypes) is very important for ensuring that feature data is correctly validated and registered.

Configuration

The feature registry configuration is defined in a YAML file called feature_store.yaml:

The configuration tells Feast what infrastructure to make use of and where its metadata and have data live, and it generally comprises the next:

  • Project name: Namespace for project
  • Provider: Execution environment (e.g., local, Kubernetes, cloud)
  • Registry location: Location of feature metadata storage (file or databases like PostgreSQL)
  • Offline store: Location from which historical features data is read
  • Online store: Location from which low-latency features are served (not relevant in our case)

In our case, we use PostgreSQL (running in a Docker container) for the feature registry and the Ray offline store for optimized feature retrieval.

Feast Apply

Once definitions and configuration are arrange, we run feast apply to register and synchronize the definitions with the registry and provision the required infrastructure.

The command might be present in the Makefile:

# Step 2: Register Feast feature definitions in PostgreSQL registry
apply:
 cd feature_store && feast apply

(4.5) Retrieve Features for Model Training

Once our feature store is prepared, we proceed with training the ML model. 

We start by creating the entity spine for retrieval (i.e., the 2 columns of customer_id and event_timestamp), which Feast uses to retrieve the proper feature snapshot.

We then execute the retrieval of features for model training at runtime:

  • FeatureStore is the Feast object that’s used to define, create, and retrieve features at runtime
  • get_historical_features() is designed for offline feature retrieval (versus get_online_features()), and it expects the entity DataFrame and the list of features to retrieve. The distributed reads and point-in-time joins of feature data happen here.

(4.7) Retrieve Features for Inference

We end off by generating predictions from our trained model.

The feature retrieval codes for inference are largely much like those for training, since we’re reaping the advantages of a consistent feature store.

The principal difference comes from the various cutoff dates used.


Wrapping It Up

Feature engineering is an important component of constructing ML models, nevertheless it also introduces data management challenges if not properly handled.

In this text, we clearly demonstrated methods to use Feast and Ray to enhance the management, reusability, and efficiency of feature engineering.

Understanding and applying these concepts will enable teams to construct efficient ML pipelines with scalable feature engineering capabilities.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x