AWS vs. Azure: A Deep Dive into Model Training – Part 2

-

In Part 1 of this series, how Azure and AWS take fundamentally different approaches to machine learning project management and data storage.

Azure ML uses a workspace-centric structure with user-level role-based access control (RBAC), where permissions are granted to individuals based on their responsibilities. In contrast, AWS SageMaker adopts a job-centric architecture that decouples user permissions from job execution, granting access on the job level through IAM roles. For data storage, Azure ML relies on datastores and data assets inside workspaces to administer connections and credentials behind the scenes, while AWS SageMaker integrates directly with S3 buckets, requiring explicit permission grants for SageMaker execution roles to access data.

Discover more in this text:

Having established how these platforms handle project setup and data access, in Part 2, we’ll examine the compute resources and runtime environments that power the model training jobs.

Compute

Compute is the virtual machine where your model and code run. Together with network and storage, it’s one among the elemental constructing blocks of cloud computing. Compute resources typically represent the most important cost component of an ML project, as training models—especially large AI models—requires long training times and sometimes specialized compute instances (e.g., GPU instances) with higher costs. Due to this fact, Azure ML designs a dedicated AzureML Compute Operator role (see details in Part 1) for managing compute resources.

Azure and AWS offer various instance types that differ within the variety of CPUs/GPUs, memory, disk space and kind, each designed for specific purposes. Each platforms use a pay-as-you-go pricing model, charging just for lively compute time.

Azure virtual machine series are named in alphabetic order; for example, D family VMs are designed for general-purpose workloads and meet the necessities for many development and production environments. AWS compute instances are also grouped into families based on their purpose; for example, the m5 family accommodates general-purpose instances for SageMaker ML development. The table below compares compute instances offered by Azure and AWS based on their purpose, hourly pricing and typical use cases. ()

Now that we’ve compared compute pricing in AWS and Azure, let’s explore how the 2 platforms differ in integrating compute resources into ML systems.

Azure ML

Azure Compute for ML

Computes are persistent resources within the Azure ML Workspace, typically created once by the AzureML Compute Operator and reused by the info science team. Since compute resources are cost-intensive, this structure allows them to be centrally managed by a task with cloud infrastructure expertise, while data scientists and engineers can give attention to development work.

Azure offers a spectrum of compute goal options designated for ML development and deployment, depending on the size of the workload. A compute instance is a single-node machine suitable for interactive development and testing within the Jupyter notebook environment. A compute cluster is one other variety of compute goal that spins up multi-node cluster machines. It may well be scaled for parallel processing based on workload demand and supports auto-scaling by configuring the parameter min_instances and max_instances. Moreover, there are severless compute, Kubernetes clusters, and containers which are fit for various purposes. Here’s a useful visual summary that helps you make the choice based in your use case.

image from “[Explore and configure the Azure Machine Learning workspace DP-100](https://www.youtube.com/watch?v=_f5dlIvI5LQ)”
image from “Explore and configure the Azure Machine Learning workspace DP-100

To create an Azure ML managed compute goal we create an AmlCompute object using the code below:

  • type: use"amlcompute" for compute cluster. Alternatively, use "computeinstance" for single-node interactive development and “kubernetes" for AKS clusters.
  • name: specify the compute goal name.
  • size: specify the instance size.
  • min_instances and max_instances (optional): set the range of instances allowed to run concurrently.
  • idle_time_before_scale_down (optional): routinely shut down the compute cluster when idle to avoid incurring unnecessary costs.
# Create a compute cluster
cpu_cluster = AmlCompute(
    name="cpu-cluster",
    type="amlcompute",
    size="Standard_DS3_v2",
    min_instances=0,
    max_instances=4,
    idle_time_before_scale_down=120
)

# Create or update the compute
ml_client.compute.begin_create_or_update(cpu_cluster)

Once the compute resource is created, anyone within the shared Workspace can use it by simply referencing its name in an ML job, making it easily accessible for team collaboration.

# Use the persevered compute "cpu-cluster" within the job
job = command(
    code='./src',
    command='python code.py',
    compute='cpu-cluster',
    display_name='train-custom-env',
    experiment_name='training'
)

AWS SageMaker AI

AWS Compute Instance

Compute resources are managed by a standalone AWS service – EC2 (Elastic Compute Cloud). When using these compute resources in SageMaker, it require developers to explicitly configure the instance type for every job, then compute instances are created on-demand and terminated when the job finishes. This approach gives developers more flexibility over compute selection based on task, but requires more infrastructure knowledge to pick out and manage the suitable compute resource. For instance, available instance types differ by job type. ml.t3.medium and ml.t3.large are commonly used for powering SageMaker notebooks in interactive development environments, but they will not be available for training jobs, which require more powerful instance types from the m5, c5, p3, or g4dn families.

As shown within the code snippet below, AWS SageMaker specifies the compute instance and the variety of instances running concurrently as job parameters. A compute instance with the ml.m5.xlarge type is created during job execution and charged based on the job runtime.

estimator = Estimator(
    image_uri=image_uri,
    role=role,  
    instance_type="ml.m5.xlarge", 
    instance_count=1
)

SageMaker jobs spin up on-demand instances by default. They’re charged by seconds and provides guaranteed capability for running time-sensitive jobs. For jobs that may tolerate interruptions and better latency, spot instance is a more cost-saving option that utilizes unused compute instances. The downside is the extra waiting period when there aren’t any available spot instances. We use the code snippet below to implement a spot instance option for a training job.

  • use_spot_instances: set as True to make use of spot instances, otherwise default to on-demand
  • max_wait: the utmost period of time you might be willing to attend for available spot instances (waiting time is just not charged)
    max_run: the utmost amount of coaching time allowed for the job
  • checkpoint_s3_uri: the S3 bucket URI path to save lots of model checkpoints, in order that training can safely restart after waiting
estimator = Estimator(
    image_uri=image_uri,
    role=role,  
    instance_type="ml.m5.xlarge", 
    instance_count=1,
    use_spot_instances=True, 
    max_run=3600,
    max_wait=7200,  
    checkpoint_s3_uri=""  
)

What does this mean in practice?

  • Azure ML: Azure’s persistent compute approach allows centralized management and sharing across multiple developers, allowing data scientists to give attention to model development moderately than infrastructure management.
  • AWS SageMaker AI: SageMaker requires developers to explicitly define compute instance type for every job, providing more flexibility but in addition demanding deeper infrastructure knowledge of instance types, costs and availability constraints.

Reference

Environment

Environment defines where the code or job is run, including software, operating system, program packages, docker image and environment variables. While compute is chargeable for the underlying infrastructure and hardware selections, environment setup is crucial in ensuring consistent and reproducible behaviors across development and production environment, mitigating package conflicts and dependency issues when executing the identical code in several runtime setup by different developers. Azure ML and SageMaker each support using their curated environments and establishing custom environments.

Azure ML

Just like Data and Compute, Environment is taken into account a variety of resource and asset within the Azure ML Workspace. Azure ML offers a comprehensive list of curated environments for popular python frameworks (e.g. PyTorch, Tensorflow, scikit-learn) designed for CPU or GPU/CUDA goal.

The code snippet below helps to retrieve the list of all curated environments in Azure ML. They typically follow a naming convention that features the framework name, version, operating system, Python version, and compute goal (CPU/GPU), e.g.AzureML-sklearn-1.0-ubuntu20.04-py38-cpu indicates scikit-learn version 1.0, running on Ubuntu 20.04 with Python 3.8 for CPU compute.

envs = ml_client.environments.list()
for env in envs:
    print(env.name)
    
    
# >>> Auzre ML Curated Environments
"""
AzureML-AI-Studio-Development
AzureML-ACPT-pytorch-1.13-py38-cuda11.7-gpu
AzureML-ACPT-pytorch-1.12-py38-cuda11.6-gpu
AzureML-ACPT-pytorch-1.12-py39-cuda11.6-gpu
AzureML-ACPT-pytorch-1.11-py38-cuda11.5-gpu
AzureML-ACPT-pytorch-1.11-py38-cuda11.3-gpu
AzureML-responsibleai-0.21-ubuntu20.04-py38-cpu
AzureML-responsibleai-0.20-ubuntu20.04-py38-cpu
AzureML-tensorflow-2.5-ubuntu20.04-py38-cuda11-gpu
AzureML-tensorflow-2.6-ubuntu20.04-py38-cuda11-gpu
AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu
AzureML-sklearn-1.0-ubuntu20.04-py38-cpu
AzureML-pytorch-1.10-ubuntu18.04-py38-cuda11-gpu
AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu
AzureML-pytorch-1.8-ubuntu18.04-py37-cuda11-gpu
AzureML-sklearn-0.24-ubuntu18.04-py37-cpu
AzureML-lightgbm-3.2-ubuntu18.04-py37-cpu
AzureML-pytorch-1.7-ubuntu18.04-py37-cuda11-gpu
AzureML-tensorflow-2.4-ubuntu18.04-py37-cuda11-gpu
AzureML-Triton
AzureML-Designer-Rating
AzureML-VowpalWabbit-8.8.0
AzureML-PyTorch-1.3-CPU
"""

To run the training job in a curated environment, we create an environment object by referencing its name and version, then passing it as a job parameter.

# Get an curated Environment
environment = ml_client.environments.get("AzureML-sklearn-1.0-ubuntu20.04-py38-cpu", version=44)

# Use the curated environment in Job
job = command(
    code=".",
    command="python train.py",
    environment=environment,
    compute="cpu-cluster"
)

ml_client.jobs.create_or_update(job)

Alternatively, create a custom environment from a Docker image registered in Docker Hob using the code snippet below.

# Get an curated Environment
environment = ml_client.environments.get("AzureML-sklearn-1.0-ubuntu20.04-py38-cpu", version=44)

# Use the curated environment in Job
job = command(
    code=".",
    command="python train.py",
    environment=environment,
    compute="cpu-cluster"
)

ml_client.jobs.create_or_update(job)

AWS SageMaker AI

SageMaker’s environment configuration is tightly coupled with job definitions, offering three levels of customization to ascertain the OS, frameworks and packages required for job execution. These are Built-in Algorithm, Bring Your Own Script (Script mode) and Bring Your Own Container (BYOC), starting from the most straightforward yet rigid choice to probably the most complex yet customizable option.

Built-in Algorithms

AWS Sagemaker Built-in Algorithm

That is the choice with the smallest amount of effort for developers to coach and deploy machine learning models at scale in AWS SageMaker and Azure currently doesn’t offer an equivalent built-in algorithm approach using Python SDK as of February 2026.

SageMaker encapsulates the machine learning algorithm, in addition to its python library and framework dependencies inside an estimator object. For instance, here we instantiate a KMeans estimator by specifying the algorithm-specific hyperparameter k and passing the training data to suit the model. Then the training job will spin up a ml.m5.large compute instance and the trained model will probably be saved within the output location.

Bring Your Own Script

The bring your individual script approach (also often known as script mode or bring your individual model) allows developers to leverage SageMaker’s prebuilt containers for popular python frameworks for machine learning like scikit-learn, PyTorch and Tensorflow. It provides the pliability of customizing the training job through your individual script without the necessity of managing the job execution environment, making it the most well-liked alternative when using specialized algorithms not included in SageMaker’s built-in options.

In the instance below, we instantiate an estimator using the scikit-learn framework by providing a custom training script train.py, the model’s hyperparameters, together with the framework version and python version.

from sagemaker.sklearn import SKLearn

sk_estimator = SKLearn(
    entry_point="train.py",
    role=role,
    instance_count=1,
    instance_type="ml.m5.large",
    py_version="py3",
    framework_version="1.2-1",
    script_mode=True,
    hyperparameters={"estimators": 20},
)

# Train the estimator
sk_estimator.fit({"train": training_data})

Bring Your Own Container

That is the approach with the best level of customization, which allows developers to bring a custom environment using a Docker image. It suits scenarios that depend on unsupported python frameworks, specialized packages, or other programming languages (e.g. R, Java etc). The workflow involves constructing a Docker image that accommodates all required package dependencies and model training scripts, then push it to Elastic Container Registry (ECR), which is AWS’s container registry service corresponding to Docker Hub.

Within the code below, we specify the custom docker image URI as a parameter to create the estimator and fit the estimator with training data.

from sagemaker.estimator import Estimator

image_uri = ":"

byoc_estimator = Estimator(
    image_uri=image_uri,
    role=role,
    instance_count=1,
    instance_type="ml.m5.large",
    output_path="",
    sagemaker_session=sess,
)

byoc_estimator.fit(training_data)

What does it mean in practice?

  • Azure ML: Provides support for running training jobs using its extensive collection of curated environments that cover popular frameworks comparable to PyTorch, TensorFlow, and scikit-learn, in addition to offering the aptitude to construct and configure custom environments from Docker images for more specialized use cases. Nevertheless, it will be important to notice that Azure ML doesn’t currently offer the built-in algorithm approach that encapsulates and packages popular machine learning algorithms directly into the environment in the identical way that SageMaker does.
  • AWS SageMaker AI: SageMaker is thought for its three level of customizations——which cover a spectrum of developers requirements. and use AWS’s managed environments and integrate tightly with ML algorithms or frameworks. They provide simplicity but are less suitable for highly specialized model training processes.

In Summary

Based on the comparisons of Compute and Environment above together with what we discussed in AWS vs. Azure: A Deep Dive into Model Training — Part 1 (Project Setup and Data Storage), we’d have realized the 2 platforms adopt different design principles to structure their machine learning ecosystems.

Azure ML follows a more modular architecture where Data, Compute, and Environment are treated as independent resources and assets inside the Azure ML Workspace. Since they will be configured and managed individually, this approach is more beginner-friendly, especially for users without extensive cloud computing or permission management knowledge. As an example, an information scientist can create a training job by attaching an existing compute within the Workspace with no need infrastructural expertise to administer compute instances.

AWS SageMaker has a steeper learning curve, as multiple services are tightly coupled and orchestrated together as a holistic system for ML job execution. Nevertheless, this job-centric approach offers clear separation between model training and model deployment environments, in addition to the flexibility for distributed training at scale. By giving developers more infrastructure control, SageMaker is well suited to large-scale data science and AI teams with high MLOps maturity and the necessity of CI/CD pipelines.

Take-Home Message

On this series, we compare the 2 hottest cloud platforms Azure and AWS for scalable model training, breaking down the comparison into the next dimensions:

  • Project and Permission Management
  • Data storage
  • Compute
  • Environment

In Part 1, we discussed high-level project setup and permission management, then talked about storing and accessing the info required for model training.

In Part 2, we examined how Azure ML’s persistent, workspace-centric compute resources differ from AWS SageMaker’s on-demand, job-specific approach. Moreover, we explored environment customization options, from Azure’s curated environments and custom environments to SageMaker’s three level of customizations—. This comparison reveals Azure ML’s modular, beginner-friendly architecture vs. SageMaker’s integrated, job-centric design that provides greater scalability and infrastructure control for teams with MLOps requirements.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x