Scaling Volatile ML Models in Production

“We discovered that they were not only service providers, but partners who were invested in our goals and outcomes” – Nicolas Kuzak, Senior ML Engineer at Rocket Money.

We created Rocket Money (a private finance app formerly often known as Truebill) to assist users improve their financial wellbeing. Users link their bank accounts to the app which then classifies and categorizes their transactions, identifying recurring patterns to supply a consolidated, comprehensive view of their personal financial life. A critical stage of transaction processing is detecting known merchants and services, a few of which Rocket Money can cancel and negotiate the fee of for members. This detection starts with the transformation of short, often truncated and cryptically formatted transaction strings into classes we are able to use to counterpoint our product experience.

The Journey Toward a Recent System

We first extracted brands and products from transactions using regular expression-based normalizers. These were utilized in tandem with an increasingly intricate decision table that mapped strings to corresponding brands. This technique proved effective for the primary 4 years of the corporate when classes were tied only to the products we supported for cancellations and negotiations. Nevertheless, as our user base grew, the subscription economy boomed and the scope of our product increased, we would have liked to maintain up with the speed of latest classes while concurrently tuning regexes and stopping collisions and overlaps. To deal with this, we explored various traditional machine learning (ML) solutions, including a bag of words model with a model-per-class architecture. This technique struggled with maintenance and performance and was mothballed.

We decided to start out from a clean slate, assembling each a brand new team and a brand new mandate. Our first task was to build up training data and construct an in-house system from scratch. We used Retool to construct labeling queues, gold standard validation datasets, and drift detection monitoring tools. We explored a lot of different model topologies, but ultimately selected a BERT family of models to unravel our text classification problem. The majority of the initial model testing and evaluation was conducted offline inside our GCP warehouse. Here we designed and built the telemetry and system we used to measure the performance of a model with 4000+ classes.

Solving Domain Challenges and Constraints by Partnering with Hugging Face

There are a lot of unique challenges we face inside our domain, including entropy injected by merchants, processing/payment corporations, institutional differences, and shifts in user behavior. Designing and constructing efficient model performance alerting together with realistic benchmarking datasets has proven to be an ongoing challenge. One other significant hurdle is determining the optimal variety of classes for our system – each class represents a major amount of effort to create and maintain. Subsequently, we must consider the worth it provides to users and our business.

With a model performing well in offline testing and a small team of ML engineers, we were faced with a brand new challenge: seamless integration of that model into our production pipeline. The present regex system processed greater than 100 million transactions per 30 days with a really bursty load, so it was crucial to have a high-availability system that might scale dynamically to load and maintain a low overall latency inside the pipeline coupled with a system that was compute-optimized for the models we were serving. As a small startup on the time, we selected to purchase slightly than construct the model serving solution. On the time, we didn’t have in-house model ops expertise and we would have liked to focus the energy of our ML engineers on enhancing the performance of the models inside the product. With this in mind, we set out in the hunt for the answer.

To start with, we auditioned a hand-rolled, in-house model hosting solution we had been using for prototyping, comparing it against AWS Sagemaker and Hugging Face’s recent model hosting Inference API. On condition that we use GCP for data storage and Google Vertex Pipelines for model training, exporting models to AWS Sagemaker was clunky and bug prone. Thankfully, the arrange for Hugging Face was quick and straightforward, and it was in a position to handle a small portion of traffic inside every week. Hugging Face simply worked out of the gate, and this reduced friction led us to proceed down this path.

After an in depth three-month evaluation period, we selected Hugging Face to host our models. During this time, we regularly increased transaction volume to their hosted models and ran quite a few simulated load tests based on our worst-case scenario volumes. This process allowed us to fine-tune our system and monitor performance, ultimately giving us confidence within the inference API’s ability to handle our transaction enrichment loads.

Beyond technical capabilities, we also established a robust rapport with the team at Hugging Face. We discovered they were not only service providers, but partners who were invested in our goals and outcomes. Early in our collaboration we arrange a shared Slack channel which proved invaluable. We were particularly impressed by their prompt response to issues and proactive approach to problem-solving. Their engineers and CSMs consistently demonstrated their commitment in our success and dedication to doing things right. This gave us an extra layer of confidence when it was time to make the ultimate selection.

Integration, Evaluation, and the Final Selection

“Overall, the experience of working hand in hand with Hugging Face on model deployment has been enriching for our team and has instilled in us the arrogance to push for greater scale”– Nicolas Kuzak, Senior ML Engineer at Rocket Money.

Once the contract was signed, we began the migration of moving off our regex based system to direct an increasing amount of critical path traffic to the transformer model. Internally, we had to construct some recent telemetry for each model and production data monitoring. On condition that this method is positioned so early within the product experience, any inaccuracies in model outcomes could significantly impact business metrics. We ran an in depth experiment where recent users were split equally between the old system and the brand new model. We assessed model performance at the side of broader business metrics, similar to paid user retention and engagement. The ML model clearly outperformed by way of retention, leading us to confidently make the choice to scale the system – first to recent users after which to existing users – ramping to 100% over a span of two months.

With the model fully positioned within the transaction processing pipeline, each uptime and latency became major concerns. A lot of our downstream processes depend on classification results, and any complications can result in delayed data or incomplete enrichment, each causing a degraded user experience.

The inaugural 12 months of collaboration between Rocket Money and Hugging Face was not without its challenges. Each teams, nonetheless, displayed remarkable resilience and a shared commitment to resolving issues as they arose. One such instance was after we expanded the variety of classes in our second production model, which unfortunately led to an outage. Despite this setback, the teams persevered, and we have successfully avoided a reoccurrence of the identical issue. One other hiccup occurred after we transitioned to a brand new model, but we still received results from the previous one resulting from caching issues on Hugging Face’s end. This issue was swiftly addressed and has not recurred. Overall, the experience of working hand in hand with Hugging Face on model deployment has been enriching for our team and has instilled in us the arrogance to push for greater scale.

Speaking of scale, as we began to witness a major increase in traffic to our model, it became clear that the fee of inference would surpass our projected budget. We made use of a caching layer prior to inference calls that significantly reduces the cardinality of transactions and attempts to profit from prior inference. Our problem technically could achieve a 93% cache rate, but we’ve only ever reached 85% in a production setting. With the model serving 100% of predictions, we’ve had just a few milestones on the Rocket Money side – our model has been in a position to scale to a run rate of over a billion transactions per 30 days and manage the surge in traffic as we climbed to the #1 financial app within the app store and #7 overall, all while maintaining low latency.

Collaboration and Future Plans

“The uptime and confidence we’ve got within the HuggingFace Inference API has allowed us to focus our energy on the worth generated by the models and fewer on the plumbing and day-to-day operation” – Nicolas Kuzak, Senior ML Engineer at Rocket Money.

Post launch, the interior Rocket Money team is now specializing in each class and performance tuning of the model along with more automated monitoring and training label systems. We add recent labels each day and encounter the fun challenges of model lifecycle management, including unique things like company rebranding and recent corporations and products emerging after Rocket Corporations acquired Truebill in late 2021.

We consistently examine whether we’ve got the correct model topology for our problem. While LLMs have recently been within the news, we’ve struggled to search out an implementation that may outperform our specialized transformer classifiers presently in each speed and value. We see promise within the early results of using them within the long tail of services (i.e. mom-and-pop shops) – keep an eye fixed out for that in a future version of Rocket Money! The uptime and confidence we’ve got within the HuggingFace Inference API has allowed us to focus our energy on the worth generated by the models and fewer on the plumbing and day-to-day operation. With the assistance of Hugging Face, we’ve got taken on more scale and complexity inside our model and the forms of value it generates. Their customer support and support have exceeded our expectations and so they’re genuinely a fantastic partner in our journey.

If you ought to learn the way Hugging Face can manage your ML inference workloads, contact the Hugging Face team here.

Source link

Scaling Volatile ML Models in Production

“We discovered that they were not only service providers, but partners who were invested in our goals and outcomes” – Nicolas Kuzak, Senior ML Engineer at Rocket Money.

The Journey Toward a Recent System

Solving Domain Challenges and Constraints by Partnering with Hugging Face

Integration, Evaluation, and the Final Selection

“Overall, the experience of working hand in hand with Hugging Face on model deployment has been enriching for our team and has instilled in us the arrogance to push for greater scale”– Nicolas Kuzak, Senior ML Engineer at Rocket Money.

Collaboration and Future Plans

“The uptime and confidence we’ve got within the HuggingFace Inference API has allowed us to focus our energy on the worth generated by the models and fewer on the plumbing and day-to-day operation” – Nicolas Kuzak, Senior ML Engineer at Rocket Money.

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

A Tale of Two Variances: Why NumPy and Pandas Give Different Answers

How Vision Language Models Are Trained from “Scratch”

Why Care About Prompt Caching in LLMs?

Supply-chain attack using invisible code hits GitHub and other repositories

Introducing NVIDIA NeMo Retriever’s Generalizable Agentic Retrieval Pipeline

Scaling Volatile ML Models in Production​

“We discovered that they were not only service providers, but partners who were invested in our goals and outcomes” – Nicolas Kuzak, Senior ML Engineer at Rocket Money.

The Journey Toward a Recent System

Solving Domain Challenges and Constraints by Partnering with Hugging Face

Integration, Evaluation, and the Final Selection

“Overall, the experience of working hand in hand with Hugging Face on model deployment has been enriching for our team and has instilled in us the arrogance to push for greater scale”– Nicolas Kuzak, Senior ML Engineer at Rocket Money.

Collaboration and Future Plans

“The uptime and confidence we’ve got within the HuggingFace Inference API has allowed us to focus our energy on the worth generated by the models and fewer on the plumbing and day-to-day operation” – Nicolas Kuzak, Senior ML Engineer at Rocket Money.

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

Scaling Volatile ML Models in Production

What are your thoughts on this topic?
Let us know in the comments below.