Home Artificial Intelligence A Framework for Constructing a Production-Ready Feature Engineering Pipeline Introduction Lessons: Data Source: Lesson 1: Batch Serving. Feature Stores. Feature Engineering Pipelines. Conclusion

A Framework for Constructing a Production-Ready Feature Engineering Pipeline Introduction Lessons: Data Source: Lesson 1: Batch Serving. Feature Stores. Feature Engineering Pipelines. Conclusion

3
A Framework for Constructing a Production-Ready Feature Engineering Pipeline
Introduction
Lessons:
Data Source:
Lesson 1: Batch Serving. Feature Stores. Feature Engineering Pipelines.
Conclusion

Within the file, we have now the important entry point of the pipeline under the tactic.

As you’ll be able to see below, the run method follows on a high level the precise steps of an ETL pipeline:

  1. — Extract the information from the energy consumption API.
  2. — Transform the extracted data.
  3. — Construct the information validation and integrity suite. Ignore this step, as we are going to insist on it in Lesson 6.
  4. — Load the information within the feature store.

Please note how I used the logger to reflect the system’s current state. When your program is deployed and running 24/7, having verbose logging is crucial to debugging the system. Also, all the time use the Python logger as a substitute of the print method, as you’ll be able to select different logging levels and output streams.

On the next level, it seems easy to know. Let’s dive into each component individually.

Within the extracting step, we request data for a given window length. The window may have a length equal to . The primary data point of the window is and the last data point of the window is the same as

We used the parameter to maneuver the window based on the delay of the information. In our use case, the API has a delay of 15 days.

As explained above, the function makes an HTTP GET request to the API requesting data. Afterward, the response is decoded and loaded right into a Pandas DataFrame.

The function returns the DataFrame plus additional metadata containing information in regards to the data’s extraction.

The transform steps take the raw DataFrame and apply the next transformations:

  • rename the columns to a Python-standardized format
  • solid the columns to their suited type
  • encode the strings columns to ints

Note that we’ve not included our EDA step (e.g., in search of null values), as our primary focus is on designing the system, not on the usual data science process.

That is where we be sure that the information is as expected. In our case, based on our EDA and transformations, we’re looking that:

  • the information doesn’t have any nulls
  • the varieties of columns are as expected
  • the range of values is as expected

More on this subject in Lesson 6.

That is where we load our processed DataFrame into the feature store.

Hopsworks has a set of great tutorials which you’ll be able to check here. But let me explain what is occurring:

  • We login into our Hopsworks project using our API_KEY.
  • We get a reference to the feature store.
  • We get or create a feature group which is essentially a database table with all of the goodies of a feature store on top of it (read more here [5]).
  • We insert our latest processed data samples.
  • We add a set of feature descriptions for each feature of our data.
  • We command Hopsworks to compute statistics for each feature.

Try the video below to see what I explained above looks into Hopsworks 👇

Hopsworks overview [Video by the Author].

Awesome! Now we have now a Python ETL script that extracts the information from the energy consumption API for a given time window and loads it into the feature store.

One final step is to create a feature view and training dataset that can later be ingested into the training pipeline.

The feature pipeline is the one process that does WRITES to the feature store. Other components will only query the feature store for various datasets. By doing so, we are able to safely use the feature store as our only source of truth and share the feature across the system.

In thefile, we have now the create() method that runs the next logic:

  1. We load the metadata from the feature pipeline. Do not forget that the FE metadata comprises the beginning and end of the extraction window, the version of the feature group, etc.
  2. We login into the Hopswork project & create a reference to the feature store.
  3. We delete all of the old feature views (normally, you do not have to do that step. Quite the other, you wish to keep your old datasets. But Hopwork’s free version limits you to 100 feature views. Thus, we desired to keep our free version).
  4. We get the feature group based on the given version.
  5. We create a feature view with all the information from the loaded feature group.
  6. We create a training dataset using only the given time window.
  7. We create a snapshot of the metadata and put it aside to disk.

A feature view is a brilliant way of mixing multiple feature groups right into a single “dataset.” It is comparable to a VIEW in a SQL database. You possibly can read more about feature views here [4].

That was it. You built a feature pipeline that extracts, transforms, and loads the information to a feature store. Based on the information from the feature store, you created a feature view and training dataset that can later be used throughout the system as the one source of truth.

You would like good software engineering principles and patterns knowledge to construct robust feature engineering pipelines. You possibly can read some hands-on examples here.

3 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here