Thoughts on Stateful ML, Online Learning, and Intelligent ML Model Retraining Definitions Designing an MVP for online learning Designing something that scales Some sensible architectures for intelligent retraining, continuous learning, and online learning Next steps

Designing scalable architecture for online and offline continuous learning systems

Ever since I read Chip Huyen’s Real-time machine learning: challenges and solutions, I’ve been excited about the long run of machine learning in production. Short feedback loops, real-time features, and stateful ML model deployments able to learning online merit a really different kind of systems architecture that most of the stateless ML model deployments I work with today.

Me considering ‘bout stateful ML in Cozumel, MX — Image by Creator

For the past few months, I’ve been conducting informal user research, white-boarding, and doing ad-hoc development to get to the core of what an actual stateful ML system might appear like. For essentially the most part, this post outlines the story of my thought process and I proceed to dive into this space and uncover interesting and unique architectural challenges.

involves updating model parameters as an alternative of retraining from scratch with a purpose to:

Decrease training time
Save cost
Update models more often

Stateless versus stateful retraining — from Chip Huyen with permission

involves learning from ground truth examples in real-time with a purpose to:

Increase model performance and reactivity
Mitigate performance issues that might result from drift/staleness

Without delay, most learning within the industry is finished offline in batch.

typically refers to mechanically retraining models using some performance metric versus on a set schedule with a purpose to:

Reduce cost without sacrificing performance

Without delay, most models across industries are retrained on a schedule using DAGs.

Intelligent retraining architecture from A Guide To Automated Model Retraining — by Arize AI with permission

In a previous article, I’d tried to make use of foundational engineering principles with a purpose to create a dead easy online learning architecture. My first thought — to model stateful, online learning architecture after stateful web applications. by treating the “model” because the DB (where predictions are reads and incremental training sessions are writes), I believed I would simplify the design process.

To a level, I actually did! By utilizing the web learning library River, I built a small, stateful online learning application that allowed me to update a model and serve predictions in real-time.

Flask app that shares a model in memory across multiple staff — Image by Creator

This approach was cool and fun to code — but has some fundamental issues at scale:

We will easily share a model within the memory of a single application — but this approach doesn’t scale approach multiple pods in orchestration engines like Kubernetes
I don’t know (and don’t need to be the one to seek out out) in regards to the caveats of attempting to support a deployment that mixes training and serving
Online learning is essentially the most proactive form of machine learning possible, but we haven’t even validated we’d like it in the primary place. There needs to be a greater place to start out…

Let’s start from an existing standard — distributed model training. It’s fairly common practice to make use of something like a parameter server as a centralized store while multiple staff calculate a partial/distributed gradient…or something…and reconcile the parameters after the actual fact.

So — I believed I’d attempt to this about this within the context of real-time model serving deployments, and got here up with the dumbest architecture possible.

An architecture that is mindless — Image by Creator

Distributed model training is mean to hurry up the training process. Nonetheless, on this instance there’s no real have to be each training and serving in a distributed fashion — keeping the training decentralized introduces complexity and serves no purpose in a web-based training system. It makes far more sense to separate training entirely.

An architecture that makes barely more sense — Image by Creator

Great! Type of. At this point I needed to take a step back, as I used to be making quite a number of assumptions and possibly getting a bit ahead of myself:

We may not give you the option to get ground truth in near-real time
Continuous online training may not provide a net profit over continuous training offline and is a premature optimization
Offline/online learning might also not be binary — and there are scenarios where we’d want/need each!

Let’s start from a less complicated offline scenario — I need to make use of some kind of ML observability system to mechanically retrain a model based on performance metric degradation. In a scenario where I’m doing continuous training (and model weights don’t take long to update) this is possible to do without significant business impact.

Intelligent retraining and continuous online learning — Image by Creator

Amazing — the primary reasonable thing I’ve drawn all day! This method likely has a lower cost overhead than a stateless training architecture, and is reactive to changes within the model/data. We save a number of $ by only retraining as needed, and overall it’s pretty easy!

This architecture has a giant problem though….it’s not nearly as fun! What might a system appear like that has all of the reactivity of online learning with the fee savings of continuous learning and the resilience of online learning?! Hopefully, something like this…

Continuous, online learning — Image by Creator

Though there are details I still haven’t flushed out, there are quite a lot of advantages to this architecture. It allows for mixed online and offline learning (just as feature stores allow access to each streaming features and features computed offline), is extremely robust to changes in data distribution and even individual user preferences for personalized systems (recsys), and still allows us to integrate ML observability (O11y) tooling to continually measure data distributions and performance.

Nonetheless, though this might be essentially the most sensible thing diagram I’ve created yet, it still leaves quite a lot of open questions:

How/when will we evaluate the model and with what data in a web-based system? If the information distribution is subject large shifts, we’ll have to to create recent data-driven methodologies and best practices for designing a held-out evaluation set that features each old data and essentially the most recent data.
How will we reconcile an ML model that splits training processes into batch/offline and online? We’ll have to experiment with recent techniques and system architectures to permit for complex, computational operations that involve large ML models in a system like this.
How will we pull/push the model weights? On a cadence? During some event or subject to some change in metric? Each of this architectural decisions could have a major impact on the performance of our system — and without online A/B testing or other research, it’ll be difficult to validate these decisions.

In fact, considered one of my next steps is solely to start out constructing some of these items and see what happens. Nonetheless, I might appreciate insight, ideas and engagement from any and all folks within the industry to take into consideration what some paths forward is perhaps!

Please reach out on twitter, LinkedIn, or sign-up for the subsequent sessions of my course on Designing Production ML Systems this May!

Thoughts on Stateful ML, Online Learning, and Intelligent ML Model Retraining Definitions Designing an MVP for online learning Designing something that scales Some sensible architectures for intelligent retraining, continuous learning, and online learning Next steps

Designing scalable architecture for online and offline continuous learning systems

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

AI’s Growing Power Needs: Tech Industry’s Move Towards Nuclear Power

“Human Intelligence Created”… Human Intelligence Challenge Spreads Against ‘Made by AI’

What We Still Don’t Understand About Machine Learning

OpenAI Unveils SearchGPT: A Recent AI-Powered Search Engine

Public Release: Kling AI Video Generator

Thoughts on Stateful ML, Online Learning, and Intelligent ML Model Retraining Definitions Designing an MVP for online learning Designing something that scales Some sensible architectures for intelligent retraining, continuous learning, and online learning Next steps

Designing scalable architecture for online and offline continuous learning systems

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.