Our MLOps story: Production-Grade Machine Learning for Twelve Brands Formulating a plan Model Artefacts and Experiment Tracking and our first MLOOPS Model Deployment and training Monitoring and Alerting Results Considering forward Closing thoughts

Artificial Intelligence

Our MLOps story: Production-Grade Machine Learning for Twelve Brands Formulating a plan Model Artefacts and Experiment Tracking and our first MLOOPS Model Deployment and training Monitoring and Alerting Results Considering forward Closing thoughts

admin

June 8, 2023

Our MLOps story: Production-Grade Machine Learning for Twelve Brands
Formulating a plan
Model Artefacts and Experiment Tracking and our first MLOOPS
Model Deployment and training
Monitoring and Alerting
Results
Considering forward
Closing thoughts

Things we learned constructing an MLOps platform with limited means at DPG Media within the Netherlands

Deploying a machine learning model once is an easy task; repeatedly bringing machine learning models into production is far harder. To handle this complex process, the concept of MLOps (Machine Learning Operations) emerged. MLOps represents the convergence of DevOps, machine learning, and software engineering practices. There are several nuances needed here, but a greater definition of what exactly MLOps entails is an open discussion and a battleground for vendors to pitch their products. For brevity’s sake, I’d quite move on to our MLOps story.

Our MLOps journey began around September 2021. Our team had only been created six months earlier, and we began with a handful of inherited projects. Our team’s goal was easy on paper: we were to offer an information / ML platform for Online Services, an element of DPG Media that focuses on web sites and communities. Our “portfolio” consisted of a dozen brands that included amongst others: a well-liked tech news website and its’ community, two job portals, a community for expecting parents, several web sites where one could buy second-hand cars, and a pair more with similar audiences. That is relevant for the remainder of the post.

Back once we began constructing the platform, we intended to support these twelve (Dutch) brands

On the time, the ML a part of the team was only a Data Scientist and me, a Machine Learning Engineer. Each of us were hired two weeks apart, completely latest to the organisation that we were expected to roll in.

Back then, our ML production landscape (I exploit this term leniently) looked as follows:

A suggestion job to offer suggestions to users, running in Spark via Airflow for brand #1.
A pricing suggestions model giving suggestions for cars, also running from Airflow as a batch job, for brand #2.
A model to focus on the appropriate audiences in Databricks as a scheduled notebook, yet again running as a batch job, for brand #3.
A single flask application on EC2 that loaded various models into memory to facilitate demos and showcases of projects for various brands.

With twelve brands, each potentially on their very own systems and clouds, we couldn’t make strong assumptions on how anything we developed could be used.

Two Data Scientists, now gone, had done all previous projects. Most of their projects lived in Jupyter Notebooks and the Flask application that served them was hosted on EC2 — and went down ceaselessly. A single MLflow server was running on EC2, but it surely was outdated, and there have been some security concerns with it. The Databricks project was only a single notebook, not even under version control. We didn’t even know who paid for the cluster on which we had the job scheduled.

Our team needed to scale, tackle more projects, and spend only just a little time supporting old ones. We couldn’t afford to proceed as things were, especially if we desired to run real-time inference. Good opportunities for machine learning began coming in, and our clients demanded live models. The case for a structured approach towards MLOps was clear.

We could depend on some help from Data Engineers (our team had 4 dedicated data engineers) Architects, and System Engineers across DPG, but it surely was clear that we must always own the platform ourselves. Our data engineers particularly, were stretched thin and had their very own deadlines and migrations In the longer term we wanted to be those managing and updating our systems. We needed to own it ourselves.

We organised just a few brainstorming sessions inside our team and interviewed a few architects across the org. Slowly but surely, things became more set in stone and fewer scribbled on a white board in some dark corner on the fifth floor of our office. We wanted to offer MLflow one other probability and check out to maintain as much on AWS as possible. In any case, it was clear that AWS SageMaker was evolving towards a more mature platform. We were urged by considered one of the leads of one other team to adapt Terraform (or some type of IaC) early on. Since our data engineers were already using Terraform, this became a cornerstone of our platform.

*slaps whiteboard* this fella can fit just so many features in it. Image by writer.

I joined the MLOps community to learn more in regards to the tools and had a few conversations with vendors. Began going to meetups. Luckily, we had an AWS enterprise support team also available to assist us since we’re a big user of AWS already. Numerous ideas, perhaps too many, began pouring in. The MLOps scene was (still is) a little bit of a large number and that’s to be expected; what was necessary for us was moving fast and deciding what the things were that mattered most to us.

We eventually decided that the principles of the platform could be roughly the next:

Do all the things in Terraform, unless we couldn’t..
Try to stick to data mesh (which our org adapted in 2021), unless..
Use managed services where possible, unless..
Go serverless, unless..
Avoid Kubernetes for so long as possible
Construct as we go and migrate old projects once we revisit them

Despite the intention to migrate only when it was needed, considered one of the primary things we tried was to migrate the MLflow server to AWS Fargate and have the database on AWS RDS, as a substitute of running on the identical EC2 instance because the server. Per business unit we decided to host one instance, with separate servers for test and prod. With 4 BUs, this may mean 4 x 2 almost similar set-ups

Example architecture for MLflow on AWS. Taken from https://aws.amazon.com/blogs/machine-learning/managing-your-machine-learning-lifecycle-with-mlflow-and-amazon-sagemaker/

This turned out to be a reasonably expensive idea (8 copies of this arrange, all with their very own load balancers and databases!) and the quantity of projects we were doing didn’t warrant it yet by far. We might later bring this all the way down to two instances. Starting out big like this meant also determining terraform, elastic load balancers, IAM, and route 53 head on. The educational curve was pretty steep, and a number of time was spent getting accustomed to all the several parts of AWS.

Our first ML-OOPS was attempting to set-up a ton of (expensive) instances and maintain them, way before they were needed.

For experimentation (AKA Jupyter notebooks) we settled on using AWS SageMaker’s Notebook instances. The notebook instances were supplied with a lifecycle script that will set the right MLflow uri as a environment variable. We created some scripts to update these in bulk in addition to monitor the notebooks themselves, resembling throwing an alert in our slack channels if someone left their instance on outside office hours.

It soon became time to deploy our first models. We used SageMaker’s model endpoints and settled for a Lambda and API gateway set-up. SageMaker can essentially be a one stop shop for deployment, although there may be a premium you pay over spinning up your personal EC2 instances and hosting your personal models. It’s costlier over attempting to run all the things yourself on a Kubernetes cluster you manage yourself. You get so much back for the premium you pay, though. SageMaker handles various deployment strategies, autoscaling, and allows us to pick GPU types when needed.

Our models were easily deployed via the SageMaker SDK from each local and cloud environments (e.g. Jupyter notebook or Airflow). The AWS Lambda gave us extra control over in- and output to the model, and the API Gateway provided a restful interface that we could use with API keys for our users. Aside from only the model itself, all elements were deployed via terraform and adding latest models was simply adding a terraform module, mainly specifying a reputation and where to drag the lambda’s code from. This also meant we could improve the (pre-)processing to the model without changing the SageMaker endpoint, and have CI/CD for the lambdas arrange individually.

Typical deployment on AWS. Image by writer.

Training jobs were also delegated to SageMaker. It took considerable effort to create a template for our first few ML training jobs (LSTM models for text categorization made with TensorFlow) and log them to MLflow from inside the training job’s container. We made a generic template for training jobs. SageMaker is fairly opinionated on the way it desires to receive training jobs, which suggests you want to adhere to certain conventions by the platform — even when using TensorFlow. Luckily, under the hood SageMaker’s model serving still uses TensorFlow Prolonged for TensorFlow Models, so there may be some intuitive operability with SavedModels.

We orchestrated our training jobs from Airflow, and explicitly didn’t include retraining on any code merges. A few of our models are fairly expensive, some usually are not, but just about all have needs for big compute or storage. If needed before scheduled to run, we are able to simply trigger the dag and run the pipeline.

The last item on our AWS shopping list was monitoring and alerting. We first tried out Amazon Managed Prometheus and Amazon Managed Grafana, hoping we could one way or the other get the info in there that we were seeing in CloudWatch and save on our CloudWatch costs. It turned out this was possible with exporter tools. We set our sights on YACE (yet-another-cloudwatch-exporter) but that needed to live somewhere. That somewhere would soon be EC2, and later ECS.

We also had some metrics coming in from considered one of our business units that we wanted to trace ourselves. This meant that we wanted some type of interface for them to interact with. This primary seemed possible with Managed Prometheus’s Distant Write capability, but we wanted more control and were organising YACE (which was to be scraped every five minutes by Prometheus) anyway. We opted to maneuver YACE and Prometheus to an ECS cluster and arrange Distant Write plus a pushgateway to receive metrics from outside the environment. Finally, we did away with Amazon Managed Prometheus.

Unfortunately, YACE didn’t support the entire AWS services we used. We lacked exporting for SageMaker and were totally in the dead of night regarding our model endpoints. Luckily the Amazon Managed Grafana instance also pulls stats from CloudWatch, again at a slight premium.

In Amazon Managed Grafana we made a generic dashboard that we converted right into a template by taking the json model and parameterizing it. Once that was done, we rolled out dashboards for each model via terraform. Unfortunately, Amazon Managed Grafana demands an API key, for our terraform and ci/cd to operate, which has a max lifetime of 30 days. We arrange a key-rotation Lambda to destroy and re-create a key every 29 days and store it in an AWS secret we are able to request inside our terraform code. The impact of that is that when deploying a model we could now mechanically generate an API, monitoring and logging, and a custom dashboard, inside just a few seconds.

Since Grafana may be set to send alerts when metrics pass a certain threshold, this arrange also allows us to trigger alerts upon issues and forward them to slack or OpsGenie. Since we were still with two, we made an on-call schedule with each of us taking every week of on-call at a time. The trick is rarely defining an alert with a high priority.

An important services that make up the “deployment” platform at a look. Image by writer.

Our resulting “framework” is fairly lightweight and the keen reader may have noticed it doesn’t actually strive to completely automate anything end-to-end. We currently have about 15 models deployed for real-time inference, a 12 months down the road with a team of two. Above is an AWS-centric view of the platform, while the below blueprint, using the template by the AI Infrastructure Alliance, provides an outline of the features of the stack thus far.

We desired to be flexible, and felt that moving into with a heavier “do all the things for us” framework is perhaps more restricting and more costly. We attempt to make no strong assumptions. As an example, not every project has fresh data coming in or has access to feedback from production. We may not be allowed to store predictions on a regular basis. Not every model is deployed on SageMaker (some can live within the Lambda perfectly wonderful!).

Like Lak Lakshmanan’s recent post (triggeringly titled “No, you don’t need MLOps”) and the now-famous MLOps without much Ops blog series (“You don’t need an even bigger boat”) now we have a platform that strives to maintain things easy.

That’s, if you must remain flexible, save time, budget or complexity. Meme by writer.

That said, we did have a number of needs when constructing the platform. A high level-diagram of the ML lifecycle with our needs mapped out shows we now cover most of them.

A visualization of features we wanted within the platform and those who we covered. Note that we didn’t discuss all features on this blog post.

So now we’ve talked so much about what we’ve built and the way it suits into the larger picture. What’s next for us?

The platform still has a blind spot in the case of testing and validation. We’ve no formal framework aside from the odd test here and there. It’s relatively hard to construct these up front and ensure that you simply’re not prematurely optimising, especially under pressure. At the identical thing it’s something we actually can’t do without from a software development perspective.

Being a media company, now we have models that work on text, images, click streams, graphs, vectors, and tabular data. Some models devour multiple data types. Models can include anything from XGBoost and Random Forests to Transformers and various Recurrent Neural Networks and Convolutional Neural Nets. How could we even hope to construct or buy something to check all data types and all models?

One other issue to be addressed is reducing the cold start times for our Lambda functions. Lambda functions are functions as a service that run when invoked. After the primary invocation the lambda stays alive for about quarter-hour, unless one other invocation follows, as much as a max lifespan of about two hours. When allocating the resources and image for the lambda the primary time around there may be a chilly start. Sometimes that is just a few seconds, but toss in a TensorFlow import right into a Lambda and also you shouldn’t be surprised to see your APIs outing.

It’s a fact of life and inherent to using lambdas — but it surely signifies that we, or the user of an API must have error handling to cope with it in the event that they go on for too long. While it’s really useful to not use lambdas for low volumes of traffic it’s by far the most affordable option now we have and it helps to off-set the price premium now we have for SageMaker. Moreover, they’re incredibly easy to keep up. Whether the cold start is definitely an issue, though, is something that entirely is determined by the business context and request volume.

I’m hoping to maneuver away from our self-hosted MLflow sooner or later. It has no role-based access. Because of this every user can see (and delete) every model, and users might want to scroll through potentially tons of of models to be able to find the one they’re in search of. There’s also a cognitive load for using it; any Data Scientist using it is going to must actively listen to setting their experiment and calling things like mlflow.log or mlflow.autolog. Since we don’t use the choice to deploy from MLflow to SageMaker we are able to switch to considered one of the numerous other tools on this area of interest. We literally only use MLflow as a strategy to keep track of past model runs.

In summary, on this post I’ve gone over our process in making a MLOps stack that suited our needs. We used AWS services and open source tools to pick a collection of tools to permit us to handle nearly all of our use cases. It would proceed to evolve over time as our team evolves. Our primary takeaways are:

In case your team is comparatively small it’s an excellent idea togo for managed services so that you don’t have the identical overhead.
Sagemakeris great for small teams and excels in deployment (not updates, but thats for an additional time), but it surely takes some time to get comfortable with it.
Airflow and MLFlow are great tools in a ML stack because they permit orchestration and ML-bookkeeping, which again help you concentrate on the work that matters most.
No, seriously. It’s insane how much work Terraform has saved us from doing.

I hope this fully open discussion allows other teams to choose on their stack and evaluate their needs. Hopefully, in the longer term the platforms and tools we used change into more mature and integrate higher.

This post benefitted from help by Gido Schoenmacker, Joost de Wit, Kim Sterenborg, and Amine Ben Slama.

Jeffrey Luppes is a Machine Learning Engineer at DPG Media in Amsterdam, the Netherlands.

Our MLOps story: Production-Grade Machine Learning for Twelve Brands Formulating a plan Model Artefacts and Experiment Tracking and our first MLOOPS Model Deployment and training Monitoring and Alerting Results Considering forward Closing thoughts

Things we learned constructing an MLOps platform with limited means at DPG Media within the Netherlands

2 COMMENTS

LEAVE A REPLY Cancel reply