Home Artificial Intelligence Chronon — A Declarative Feature Engineering Framework Background Introducing Chronon API Overview Understanding accuracy Understanding data sources Event data sources Entity data sources Cumulative Event Sources Understanding computation contexts Understanding computation types Understanding Aggregations Putting All the things together

Chronon — A Declarative Feature Engineering Framework Background Introducing Chronon API Overview Understanding accuracy Understanding data sources Event data sources Entity data sources Cumulative Event Sources Understanding computation contexts Understanding computation types Understanding Aggregations Putting All the things together

5
Chronon — A Declarative Feature Engineering Framework
Background
Introducing Chronon
API Overview
Understanding accuracy
Understanding data sources
Event data sources
Entity data sources
Cumulative Event Sources
Understanding computation contexts
Understanding computation types
Understanding Aggregations
Putting All the things together

The Airbnb Tech Blog

Nikhil Simha Raprolu

Airbnb uses machine learning in almost every product, from rating search results to intelligently pricing listings and routing users to the precise customer support agents.

We noticed that feature management was a consistent pain point for the ML Engineers working on these projects. Slightly than specializing in their models, they were spending a number of their time gluing together other pieces of infrastructure to administer their feature data, and still encountering issues.

One common issue arose from the log-and-wait approach to generating training data, where a user logs feature values from their serving endpoint, then waits to build up enough data to coach a model. This wait period could be greater than a yr for models that have to capture seasonality. This was a significant pain point for machine learning practitioners, hindering them from responding quickly to changing user behaviors and product demands.

A standard approach to deal with this wait time is to rework raw data within the warehouse into training data using ETL jobs. Nonetheless, users encountered a critical problem once they tried to launch their model to production — they needed to write down complex streaming jobs or replicate ETL logic to serve their feature data, and infrequently couldn’t guarantee that the feature distribution for serving model inference was consistent with what they trained on. This training-serving skew led to hard-to-debug model degradation, and worse than expected model performance.

Chronon was built to deal with these pain points. It allows ML practitioners to define features and centralize the info computation for each model training and production inference, while guaranteeing consistency between the 2.

This post is concentrated on the Chronon API and capabilities. At a high level, these include:

  • — Event streams, fact/dim tables in warehouse, table snapshots, Slowly Changing Dimension tables, Change Data Streams, etc.
  • — It supports standard SQL-like transformations in addition to more powerful time-based aggregations.
  • Online, as low-latency end-points for feature serving, or Offline as Hive tables, for generating training data.
  • — You may select whether the feature values are updated in real-time or at fixed intervals with an “Accuracy” parameter. This also ensures the identical behavior even while backfilling.
  • that treats time based aggregation and windowing as first-class concepts, together with familiar SQL primitives like Group-By, Join, Select etc, while retaining the complete flexibility and composability offered by Python.

First, let’s start with an example. The code snippet computes the variety of times an item is viewed by a user within the last five hours from an activity stream, while applying some additional transformations and filters. This uses concepts like GroupBy, Aggregation, EventSource etc.,.

Within the sections below we’ll demystify these concepts.

Some use-cases require derived data to be as up-to-date as possible, while others allow for updating at a day by day cadence. For instance, understanding the intent of a user’s search session requires accounting for the most recent user activity. To display revenue figures on a dashboard for human consumption, it is generally adequate to refresh the leads to fixed intervals.

Chronon allows users to specific whether a derivation must be updated in near real-time or in day by day intervals by setting the ‘Accuracy’ of a computation — which could be either ‘Temporal’ or ‘Snapshot’. In Chronon this accuracy applies each to online serving of knowledge via low latency endpoints, and in addition offline backfilling via batch computation jobs.

Real world data is ingested into the info warehouse constantly. There are three sorts of ingestion patterns. In Chronon these ingestion patterns are specified by declaring the “type” of an information source.

Timestamped activity like views, clicks, sensor readings, stock prices etc — published into an information stream like Kafka.

In the info lake these events are stored in date-partitioned tables (Hive). Assuming timestamps are millisecond precise and the info ingestion is partition by date — a date partition ‘2023–07–04’, of click events incorporates click events that happened between ‘2023–07–04 00:00:00.000’ and ‘2023–07–04 23:59:59.999’. Users can configure the date partition based in your warehouse convention, once globally, as a Spark parameter.

— conf “spark.chronon.partition.column=date_key”

In Chronon you possibly can declare an EventSource by specifying two things, a ‘table’ (Hive) and optionally a ‘topic’ (Kafka). Chronon can use the ‘table’ to backfill data — with Temporal accuracy. When a ‘topic’ is provided, we are able to update a key-value store in real-time to serve fresh data to applications and ML models.

Attribute metadata related to business entities. Few examples for a retail business can be, user information — with attributes like address, country etc., or item information — with attributes like price, available count etc. This data is generally served online via OLTP databases like MySQL to applications. These tables are snapshotted into the warehouse normally at day by day intervals. So a ‘2023–07–04’ partition incorporates a snapshot of the item information table taken at ‘2023–07–04 23:59:59.999’.

Nonetheless these snapshots can only support ‘Snapshot’ accurate computations but insufficient for ‘Temporal’ accuracy. If you might have a change data capture mechanism, Chronon can utilize the change data stream with table mutations to keep up a near real-time refreshed view of computations. When you also capture this transformation data stream in your warehouse, Chronon can backfill computations at historical cut-off dates with ‘Temporal’ accuracy.

You may create an entity source by specifying three things: ‘snapshotTable’ and optionally ‘mutationTable’ and ‘mutationTopic’ for ‘Temporal’ accuracy. If you specify ‘mutationTopic’ — the info stream with mutations corresponding to the entity, Chronon will give you the chance to keep up a real-time updated view that could be read from in low latency. If you specify ‘mutationTable’, Chronon will give you the chance to backfill data at historical cut-off dates with millisecond precision.

This data model is often used to capture history of values for slowly changing dimensions. Entries of the underlying database table are only ever inserted and never updated apart from a surrogate (SCD2).

Also they are snapshotted into the info warehouse using the identical mechanism as entity sources. But because they track all changes within the snapshot, just the most recent partition is sufficient for backfilling computations. And no ‘mutationTable’ is required.

In Chronon you possibly can specify a Cumulative Event Source by creating an event source with ‘table’ and ‘topic’ as before, but in addition by enabling a flag ‘isCumulative’. The ‘table’ is the snapshot of the web database table that serves application traffic. The ‘topic’ is the info stream containing all of the insert events.

Chronon can compute in two contexts, online and offline with the identical compute definition.

Offline computation is completed over warehouse datasets (Hive tables) using batch jobs. These jobs output recent datasets. Chronon is designed to cope with datasets that change — newly arriving data into the warehouse as Hive table partitions.

Online, the usage is to serve application traffic in low latency(~10ms) at high QPS. Chronon maintains endpoints that serve features which are updated in real-time, by generating “lambda architecture” pipelines. You may set a parameter “online = True” in Python to enable this.

Under the hood, Chronon orchestrates pipelines using Kafka, Spark/Spark Streaming, Hive, Airflow and a customizable key-value store power serving and training data generation.

All chronon definitions fall into three categories — a GroupBy, Join or a StagingQuery.

is an aggregation primitive much like SQL, with native support for windowed and bucketed aggregations. This supports computation in each online and offline contexts and in each accuracy models — Temporal (realtime refreshed) and Snapshot (day by day refreshed). GroupBy has a notion of keys by which the aggregations are performed.

Joins together data from various GroupBy computations. In online mode, a join query containing keys, can be fanned out into queries per groupBy and external services and the outcomes can be joined together and responded as a map. In offline mode, joins which could be regarded as an inventory of queries at historical cut-off dates, against which the outcomes have to be computed in a point-in-time correct fashion. If the left side is Entities, we at all times compute responses as of midnight.

— allows for arbitrary computation expressed as Spark SQL query, that’s computed offline day by day. Chronon produces partitioned datasets. It’s best fitted to data pre or post processing.

GroupBys in Chronon essentially aggregate data by given keys. There are several extensions to the normal SQL group-by that make Chronon aggregations powerful.

  1. — Optionally, you possibly can decide to aggregate only recent data inside a window of time. That is critical for ML since un-windowed aggregations are inclined to grow and shift of their distributions, degrading model performance. It’s also critical to position greater emphasis on recent events over very old events.
  2. — Optionally you can too specify a second level of aggregation, on a bucket — besides the Group-By keys. The output of a bucketed aggregation is a column of map type containing the bucket column as keys and aggregates as value.
  3. — If the input column incorporates data nested inside an array, Chronon will routinely unpack.
  4. — like first_k, last_k, first, last etc when a timestamp is laid out in the info source.

You may mix all of those options flexibly to define very powerful aggregations. Chronon internally maintains partial aggregates and combines them to provide features at different points-in-time. So using very large windows and backfilling training data for big date ranges shouldn’t be an issue.

As a user, you could declare your computation just once, and Chronon will generate all of the infrastructure needed to constantly turn raw data into features for each training and serving. ML practitioners at Airbnb not spend months attempting to manually implement complex pipelines and have indexes. They typically spend lower than per week to generate recent sets of features for his or her models.

Our core goal has been to make feature engineering as productive and as scalable as possible. For the reason that release of Chronon users have developed over ten thousand features powering ML models at Airbnb.

: Dave Nagle Adam Kocoloski Paul Ellwood Joy Zhang Sanjeev Katariya Mukund Narasimhan Jack Song Weiping Peng Haichun Chen Atul Kale

: Varant Zanoyan Pengyu Hou Cristian Figueroa Haozhen Ding Sophie Wang Vamsee Yarlagadda Evgenii Shapiro Patrick Yoon

: Navjot Sidhu Xin Liu Soren Telfer Cheng Huang Tom Benner Wael Mahmoud Zach Fein Ben Mendler Michael Sestito Yinhe Cheng Tianxiang Chen Jie Tang Austin Chan Moose Abdool Kedar Bellare Mia Zhao Yang Qi Kosta Ristovski Lior Malka David Staub Chandramouli Rangarajan Guang Yang Jian Chen

5 COMMENTS

  1. … [Trackback]

    […] Find More on on that Topic: bardai.ai/artificial-intelligence/chronon-a-declarative-feature-engineering-frameworkbackgroundintroducing-chrononapi-overviewunderstanding-accuracyunderstanding-data-sourcesevent-data-sourcesentity-data-sourcescumul/…

  2. … [Trackback]

    […] Read More on on that Topic: bardai.ai/artificial-intelligence/chronon-a-declarative-feature-engineering-frameworkbackgroundintroducing-chrononapi-overviewunderstanding-accuracyunderstanding-data-sourcesevent-data-sourcesentity-data-sourcescumul/…

  3. … [Trackback]

    […] Read More on that Topic: bardai.ai/artificial-intelligence/chronon-a-declarative-feature-engineering-frameworkbackgroundintroducing-chrononapi-overviewunderstanding-accuracyunderstanding-data-sourcesevent-data-sourcesentity-data-sourcescumul/ [..…

LEAVE A REPLY

Please enter your comment!
Please enter your name here