How one can Select the Best ML Deployment Strategy: Cloud vs. Edge

The alternative between cloud and edge deployment could make or break your project

14 min read

10 hours ago

As a machine learning engineer, I incessantly see discussions on social media emphasizing the importance of deploying ML models. I completely agree — model deployment is a critical component of MLOps. As ML adoption grows, there’s a rising demand for scalable and efficient deployment methods, yet specifics often remain unclear.

So, does that mean model deployment is at all times the identical, irrespective of the context? In truth, quite the other: I’ve been deploying ML models for a few decade now, and it might probably be quite different from one project to a different. There are lots of ways to deploy a ML model, and having experience with one method doesn’t necessarily make you proficient with others.

The remaining query is: what are the methods to deploy a ML model, and how do we elect the correct method?

Models might be deployed in various ways, but they typically fall into two predominant categories:

Cloud deployment
Edge deployment

It could sound easy, but there’s a catch. For each categories, there are literally many subcategories. Here’s a non-exhaustive diagram of deployments that we’ll explore in this text:

Diagram of the explored subcategories of deployment in this text. Image by creator.

Before talking about find out how to select the correct method, let’s explore each category: what it’s, the professionals, the cons, the standard tech stack, and I can even share some personal examples of deployments I did in that context. Let’s dig in!

From what I can see, it seems cloud deployment is by far the most well-liked alternative with regards to ML deployment. That is what is generally expected to master for model deployment. But cloud deployment often means one in all these, depending on the context:

API deployment
Serverless deployment
Batch processing

Even in those sub-categories, one could have one other level of categorization but we won’t go that far in that post. Let’s have a have a look at what they mean, their pros and cons and a typical associated tech stack.

API Deployment

API stands for Application Programming Interface. This can be a highly regarded strategy to deploy a model on the cloud. Among the hottest ML models are deployed as APIs: Google Maps and OpenAI’s ChatGPT might be queried through their APIs for examples.

In the event you’re not aware of APIs, know that it’s often called with an easy query. For instance, type the next command in your terminal to get the 20 first Pokémon names:

curl -X GET https://pokeapi.co/api/v2/pokemon

Under the hood, what happens when calling an API may be a bit more complex. API deployments often involve a typical tech stack including load balancers, autoscalers and interactions with a database:

A typical example of an API deployment inside a cloud infrastructure. Image by creator.

Note: APIs can have different needs and infrastructure, this instance is simplified for clarity.

API deployments are popular for several reasons:

Easy to implement and to integrate into various tech stacks
It’s easy to scale: using horizontal scaling in clouds allow to scale efficiently; furthermore managed services of cloud providers may reduce the necessity for manual intervention
It allows centralized management of model versions and logging, thus efficient tracking and reproducibility

While APIs are a extremely popular option, there are some cons too:

There may be latency challenges with potential network overhead or geographical distance; and naturally it requires web connection
The price can climb up pretty quickly with high traffic (assuming automatic scaling)
Maintenance overhead can get expensive, either with managed services cost of infra team

To sum up, API deployment is essentially used in lots of startups and tech corporations due to its flexibility and a somewhat short time to market. However the cost can climb up quite fast for top traffic, and the upkeep cost will also be significant.

Concerning the tech stack: there are lots of ways to develop APIs, but probably the most common ones in Machine Learning are probably FastAPI and Flask. They will then be deployed quite easily on the predominant cloud providers (AWS, GCP, Azure…), preferably through docker images. The orchestration might be done through managed services or with Kubernetes, depending on the team’s alternative, its size, and skills.

For instance of API cloud deployment, I once deployed a ML solution to automate the pricing of an electrical vehicle charging station for a customer-facing web app. You possibly can have a have a look at this project here if you would like to know more about it:

Even when this post doesn’t get into the code, it might probably provide you with idea of what might be done with API deployment.

API deployment is highly regarded for its simplicity to integrate to any project. But some projects may have much more flexibility and fewer maintenance cost: that is where serverless deployment could also be an answer.

Serverless Deployment

One other popular, but probably less incessantly used option is serverless deployment. Serverless computing signifies that you run your model (or any code actually) without owning nor provisioning any server.

Serverless deployment offers several significant benefits and is kind of easy to establish:

No need to administer nor to take care of servers
No must handle scaling in case of upper traffic
You simply pay for what you utilize: no traffic means virtually no cost, so no overhead cost in any respect

However it has some limitations as well:

It is generally not cost effective for giant variety of queries in comparison with managed APIs
Cold start latency is a possible issue, as a server might must be spawned, resulting in delays
The memory footprint is generally limited by design: you’ll be able to’t at all times run large models
The execution time is proscribed too: it’s impossible to run jobs for greater than just a few minutes (quarter-hour for AWS Lambda for instance)

In a nutshell, I might say that serverless deployment is a good option while you’re launching something latest, don’t expect large traffic and don’t wish to spend much on infra management.

Serverless computing is proposed by all major cloud providers under different names: AWS Lambda, Azure Functions and Google Cloud Functions for the most well-liked ones.

I personally have never deployed a serverless solution (working mostly with deep learning, I often found myself limited by the serverless constraints mentioned above), but there may be numerous documentation about find out how to do it properly, resembling this one from AWS.

While serverless deployment offers a versatile, on-demand solution, some applications may require a more scheduled approach, like batch processing.

Batch Processing

One other strategy to deploy on the cloud is thru scheduled batch processing. While serverless and APIs are mostly used for live predictions, in some cases batch predictions makes more sense.

Whether or not it’s database updates, dashboard updates, caching predictions… as soon as there may be no must have a real-time prediction, batch processing is generally the perfect option:

Processing large batches of information is more resource-efficient and reduce overhead in comparison with live processing
Processing might be scheduled during off-peak hours, allowing to cut back the general charge and thus the associated fee

After all, it comes with associated drawbacks:

Batch processing creates a spike in resource usage, which may result in system overload if not properly planned
Handling errors is critical in batch processing, as you want to process a full batch gracefully directly

Batch processing ought to be considered for any task that doesn’t required real-time results: it is generally less expensive. But in fact, for any real-time application, it shouldn’t be a viable option.

It’s used widely in lots of corporations, mostly inside ETL (Extract, Transform, Load) pipelines that will or may not contain ML. Among the hottest tools are:

Apache Airflow for workflow orchestration and task scheduling
Apache Spark for fast, massive data processing

For instance of batch processing, I used to work on a YouTube video revenue forecasting. Based on the primary data points of the video revenue, we might forecast the revenue over as much as 5 years, using a multi-target regression and curve fitting:

Plot representing the initial data, multi-target regression predictions and curve fitting. Image by creator.

For this project, we needed to re-forecast on a monthly basis all our data to make sure there was no drifting between our initial forecasting and probably the most recent ones. For that, we used a managed Airflow, in order that every month it will routinely trigger a brand new forecasting based on probably the most recent data, and store those into our databases. If you would like to know more about this project, you’ll be able to have a have a look at this text:

After exploring the assorted strategies and tools available for cloud deployment, it’s clear that this approach offers significant flexibility and scalability. Nevertheless, cloud deployment shouldn’t be at all times the perfect fit for each ML application, particularly when real-time processing, privacy concerns, or financial resource constraints come into play.

An inventory of pros and cons for cloud deployment. Image by creator.

That is where edge deployment comes into focus as a viable option. Let’s now delve into edge deployment to grasp when it may be the perfect option.

From my very own experience, edge deployment isn’t regarded as the predominant way of deployment. Just a few years ago, even I believed it was not likely an interesting option for deployment. With more perspective and experience now, I believe it have to be regarded as the primary option for deployment anytime you’ll be able to.

Similar to cloud deployment, edge deployment covers a big selection of cases:

Native phone applications
Web applications
Edge server and specific devices

While all of them share some similar properties, resembling limited resources and horizontal scaling limitations, each deployment alternative can have their very own characteristics. Let’s take a look.

Native Application

We see an increasing number of smartphone apps with integrated AI nowadays, and it is going to probably continue to grow much more in the longer term. While some Big Tech corporations resembling OpenAI or Google have chosen the API deployment approach for his or her LLMs, Apple is currently working on the iOS app deployment model with solutions resembling OpenELM, a tini LLM. Indeed, this selection has several benefits:

The infra cost if virtually zero: no cloud to take care of, all of it runs on the device
Higher privacy: you don’t must send any data to an API, it might probably all run locally
Your model is directly integrated to your app, no need to take care of several codebases

Furthermore, Apple has built a unbelievable ecosystem for model deployment in iOS: you’ll be able to run very efficiently ML models with Core ML on their Apple chips (M1, M2, etc…) and benefit from the neural engine for really fast inferences. To my knowledge, Android is barely lagging behind, but additionally has an awesome ecosystem.

While this is usually a really helpful approach in lots of cases, there are still some limitations:

Phone resources limit model size and performance, and are shared with other apps
Heavy models may drain the battery pretty fast, which might be deceptive for the user experience overall
Device fragmentation, in addition to iOS and Android apps make it hard to cover the entire market
Decentralized model updates might be difficult in comparison with cloud

Despite its drawbacks, native app deployment is commonly a robust alternative for ML solutions that run in an app. It could seem more complex through the development phase, but it is going to turn into less expensive as soon because it’s deployed in comparison with a cloud deployment.

In relation to the tech stack, there are literally two predominant ways to deploy: iOS and Android. They each have their very own stacks, but they share the identical properties:

App development: Swift for iOS, Kotlin for Android
Model format: Core ML for iOS, TensorFlow Lite for Android
Hardware accelerator: Apple Neural Engine for iOS, Neural Network API for Android

Note: This can be a mere simplification of the tech stack. This non-exhaustive overview only goals to cover the essentials and allow you to dig in from there if interested.

As a private example of such deployment, I once worked on a book reading app for Android, during which they desired to let the user navigate through the book with phone movements. For instance, shake left to go to the previous page, shake right for the following page, and just a few more movements for specific commands. For that, I trained a model on accelerometer’s features from the phone for movement recognition with a somewhat small model. It was then deployed directly within the app as a TensorFlow Lite model.

Native application has strong benefits but is proscribed to at least one style of device, and wouldn’t work on laptops for instance. An online application could overcome those limitations.

Web Application

Web application deployment means running the model on the client side. Mainly, it means running the model inference on the device utilized by that browser, whether or not it’s a tablet, a smartphone or a laptop (and the list goes on…). This sort of deployment might be really convenient:

Your deployment is working on any device that may run an internet browser
The inference cost is virtually zero: no server, no infra to take care of… Just the client’s device
Just one codebase for all possible devices: no need to take care of an iOS app and an Android app concurrently

Note: Running the model on the server side can be reminiscent of one in all the cloud deployment options above.

While web deployment offers appealing advantages, it also has significant limitations:

Proper resource utilization, especially GPU inference, might be difficult with TensorFlow.js
Your web app must work with all devices and browsers: whether is has a GPU or not, Safari or Chrome, a Apple M1 chip or not, etc… This is usually a heavy burden with a high maintenance cost
You could need a backup plan for slower and older devices: what if the device can’t handle your model since it’s too slow?

Unlike for a native app, there isn’t any official size limitation for a model. Nevertheless, a small model can be downloaded faster, making it overall experience smoother and have to be a priority. And a really large model could not work in any respect anyway.

In summary, while web deployment is powerful, it comes with significant limitations and have to be used cautiously. Another advantage is that it may be a door to a different type of deployment that I didn’t mention: WeChat Mini Programs.

The tech stack is generally the identical as for web development: HTML, CSS, JavaScript (and any frameworks you would like), and naturally TensorFlow Lite for model deployment. In the event you’re interested by an example of find out how to deploy ML within the browser, you’ll be able to have a have a look at this post where I run an actual time face recognition model within the browser from scratch:

This text goes from a model training in PyTorch to as much as a working web app and may be informative about this specific type of deployment.

In some cases, native and web apps aren’t a viable option: we may don’t have any such device, no connectivity, or another constraints. That is where edge servers and specific devices come into play.

Edge Servers and Specific Devices

Besides native and web apps, edge deployment also includes other cases:

Deployment on edge servers: in some cases, there are local servers running models, resembling in some factory production lines, CCTVs, etc…Mostly due to privacy requirements, this solution is typically the one available
Deployment on specific device: either a sensor, a microcontroller, a smartwatch, earplugs, autonomous vehicle, etc… may run ML models internally

Deployment on edge servers might be really near a deployment on cloud with API, and the tech stack could also be quite close.

Note: Additionally it is possible to run batch processing on an edge server, in addition to just having a monolithic script that does all of it.

But deployment on specific devices may involve using FPGAs or low-level languages. That is one other, very different skillset, that will differ for every style of device. It is typically known as TinyML and is a really interesting, growing topic.

On each cases, they share some challenges with other edge deployment methods:

Resources are limited, and horizontal scaling is generally not an option
The battery could also be a limitation, in addition to the model size and memory footprint

Even with these limitations and challenges, in some cases it’s the one viable solution, or probably the most cost effective one.

An example of an edge server deployment I did was for a corporation that desired to routinely check whether the orders were valid in fast food restaurants. A camera with a top down view would have a look at the plateau, compare what’s sees on it (with computer vision and object detection) with the actual order and lift an alert in case of mismatch. For some reason, the corporate desired to make that on edge servers, that were inside the fast food restaurant.

To recap, here is a giant picture of what are the predominant kinds of deployment and their pros and cons:

With that in mind, find out how to actually select the correct deployment method? There’s no single answer to that query, but let’s try to present some rules in the following section to make it easier.

Before jumping to the conclusion, let’s make a call tree to make it easier to select the answer that matches your needs.

Selecting the correct deployment requires understanding specific needs and constraints, often through discussions with stakeholders. Keep in mind that each case is restricted and may be a edge case. But within the diagram below I attempted to stipulate probably the most common cases to make it easier to out:

Deployment decision diagram. Note that every use case is restricted. Image by creator.

This diagram, while being quite simplistic, might be reduced to just a few questions that might allow you go in the correct direction:

Do you would like real-time? If no, search for batch processing first; if yes, take into consideration edge deployment
Is your solution running on a phone or in the net? Explore these deployments method at any time when possible
Is the processing quite complex and heavy? If yes, consider cloud deployment

Again, that’s quite simplistic but helpful in lots of cases. Also, note that just a few questions were omitted for clarity but are literally greater than vital in some context: Do you’ve got privacy constraints? Do you’ve got connectivity constraints? What’s the skillset of your team?

Other questions may arise depending on the use case; with experience and knowledge of your ecosystem, they’ll come an increasing number of naturally. But hopefully this will make it easier to navigate more easily in deployment of ML models.

While cloud deployment is commonly the default for ML models, edge deployment can offer significant benefits: cost-effectiveness and higher privacy control. Despite challenges resembling processing power, memory, and energy constraints, I imagine edge deployment is a compelling option for a lot of cases. Ultimately, the perfect deployment strategy aligns with your corporation goals, resource constraints and specific needs.

In the event you’ve made it this far, I’d love to listen to your thoughts on the deployment approaches you used on your projects.

How one can Select the Best ML Deployment Strategy: Cloud vs. Edge

The alternative between cloud and edge deployment could make or break your project

API Deployment

Serverless Deployment

Batch Processing

Native Application

Web Application

Edge Servers and Specific Devices

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The Machine Learning “Advent Calendar” Day 22: Embeddings in Excel

The First Multilingual LLM Debate Competition

MIT within the media: 2025 in review

Introducing the Open Leaderboard for Japanese LLMs!

ChatLLM Presents a Streamlined Solution to Addressing the Real Bottleneck in AI

How one can Select the Best ML Deployment Strategy: Cloud vs. Edge

The alternative between cloud and edge deployment could make or break your project

API Deployment

Serverless Deployment

Batch Processing

Native Application

Web Application

Edge Servers and Specific Devices

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.