Azure ML vs. AWS SageMaker: A Deep Dive into Model Training — Part 1

(AWS) are the world’s two largest cloud computing platforms, providing database, network, and compute resources at global scale. Together, they hold about 50% of the worldwide enterprise cloud infrastructure services market—AWS at 30% and Azure at 20%. Azure ML and AWS SageMaker are machine learning services that enable data scientists and ML engineers to develop and manage the whole ML lifecycle, from data preprocessing and have engineering to model training, deployment, and monitoring. You possibly can create and manage these ML services in AWS and Azure through console interfaces, or cloud CLI, or software development kits (SDK) in your chosen programming language – the approach discussed in this text.

Azure ML & AWS SageMaker Training Jobs

While they provide similar high-level functionalities, Azure ML and AWS SageMaker have fundamental differences that determine which platform most closely fits you, your team, or your organization. Firstly, consider the ecosystem of the prevailing data storage, compute resources, and monitoring services. For example, if your organization’s data primarily sits in an AWS S3 bucket, then SageMaker may change into a more natural alternative for developing your ML services, because it reduces the overhead of connecting to and transferring data across different cloud providers. Nonetheless, this doesn’t mean that other aspects usually are not price considering, and we’ll dive into the small print of how Azure ML differs from AWS SageMaker in a typical ML scenario—training and constructing models at scale using jobs.

Although Jupyter notebooks are helpful for experimentation and exploration in an interactive development workflow on a single device, they usually are not designed for productionization or distribution. Training jobs (and other ML jobs) change into essential within the ML workflow at this stage by deploying the duty to multiple cloud instances so as to run for an extended time, and process more data. This requires organising the info, code, compute instances and runtime environments to make sure consistent outputs when it isn’t any longer executed on one local machine. Consider it just like the difference between developing a dinner recipe (Jupyter notebook) and hiring a catering team to cook it for 500 customers (ML job). It needs everyone within the catering team to access the identical ingredients, recipe and tools, following the identical cooking procedure.

Now that we understand the importance of coaching jobs, let’s have a look at how they’re defined in Azure ML vs. SageMaker in a nutshell.

Define Azure ML training job

from azure.ai.ml import command

job = command(
    code=...
    command=...
    environment=...
    compute=...
)

ml_client.jobs.create_or_update(job)

Create SageMaker training job estimator

from sagemaker.estimator import Estimator

estimator = Estimator(
    image_uri=...
    role=...
    instance_type=...
)
 
estimator.fit(training_data_s3_location)

We’ll break down the comparison into following dimensions:

Project and Permission Management
Data storage
Compute
Environment

Partly 1, we’ll start with comparing the high-level project setup and permission management, then speak about storing and accessing the info required for model training. Part 2 will discuss various compute options under each cloud platforms, and the best way to create and manage runtime environments for training jobs.

Project and Permission Management

Let’s start by understanding a typical ML workflow in a medium-to-large team of knowledge scientists, data engineers, and ML engineers. Each member may focus on a selected role and responsibility, and assigned to 1 or more projects. For instance, a knowledge engineer is tasked with extracting data from the source and storing it in a centralized location for data scientists to process. They don’t must spin up compute instances for running training jobs. On this case, they might have read and write access to the info storage location but don’t necessarily need access to create GPU instances for heavy workloads. Depending on data sensitivity and their role in an ML project, team members need different levels of access to the info and underlying cloud infrastructure. We’re going to explore how two cloud platforms structure their resources and services to balance the necessities of team collaboration and responsibility separation.

Azure ML

Project management in Azure ML is Workspace-centric, starting by making a Workspace (under your Azure subscription ID and resource group) for storing relevant resource and assets, and shared across the project team for collaboration.

Permissions to access and manage resources are granted on the user-level based on their roles – i.e. role-based access control (RBAC). Generic roles in Azure include owner, contributor and reader. ML specialized roles include AzureML Data Scientist and AzureML Compute Operator, which is liable for creating and managing compute instances as they’re generally the biggest cost element in an ML project. The objectives of organising an Azure ML Workspace is to create a contained environments for storing data, compute, model and other resources, in order that only users throughout the Workspace are given relevant access to read or edit the info assets, use existing or create recent compute instances based on their responsibilities.

Within the code snippet below, we connect with the Azure ML workspace through MLClient by passing the workspace’s subscription ID, resource group and the default credential – Azure follows the hierarchical structure Subscription > Resource Group > Workspace.

Upon workspace creation, associated services like an Azure Storage Account (stores metadata and artifacts and might store training data) and an Azure Key Vault (stores secrets like usernames, passwords, and credentials) are also instantiated routinely.

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

subscription_id = ''
resource_group = ''
workspace = ''

# Connect with the workspace
credential = DefaultAzureCredential()
ml_client = MLClient(credential, subscription, resource_group, workspace)

When developers run the code during an interactive development session, the workspace connection is authenticated through the developer’s personal credentials. They might have the opportunity to create a training job using the command ml_client.jobs.create_or_update(job) as demonstrated below. To detach personal account credentials within the production environment, it is strongly recommended to make use of a service principal account to authenticate for automated pipelines or scheduled jobs. More information might be present in this text “Authenticate in your workspace using a service principal”.

# Define Azure ML training job
from azure.ai.ml import command

job = command(
    code=...
    command=...
    environment=...
    compute=...
)

ml_client.jobs.create_or_update(job)

AWS SageMaker

Roles and permissions in SageMaker are designed based on a very different principle, primarily using “Roles” in AWS Identity Access Management (IAM) service. Although IAM allows creating user-level (or account-level) access just like Azure, AWS recommends granting permissions on the job-level throughout the ML lifecycle. In this fashion, your personal AWS permissions are irrelevant at runtime and SageMaker assumes a job (i.e. SageMaker execution role) to access relevant AWS services, akin to S3 bucket, SageMaker Training Pipeline, compute instances for executing the job.

For instance, here’s a quick peek of organising an Estimator with the SageMaker execution role for running the Training Job.

import sagemaker
from sagemaker.estimator import Estimator

# Get the SageMaker execution role
role = sagemaker.get_execution_role()

# Define the estimator
estimator = Estimator(
    image_uri=image_uri,
    role=role,  # assume the SageMaker execution role during runtime
    instance_type="ml.m5.xlarge",
    instance_count=1,
)

# Start training
estimator.fit("s3://my-training-bucket/train/")

It signifies that we will arrange enough granularity to grant role permissions to run only training jobs in the event environment but not touching the production environment. For instance, the role is given access to an S3 bucket that holds test data and is blocked from the one which holds production data, then the training job that assumes this role won’t have the prospect to overwrite the production data by accident.

Permission Management in AWS is a complicated domain by itself, and I won’t pretend I can fully explain this topic. I like to recommend reading this text for more best practices from AWS official documentation “Permissions management“.

What does this mean in practice?

Azure ML: Azure’s Role Based Access Control (RBAC) suits firms or teams that manage More intuitive to grasp and useful for centralized user access control.

AWS SageMaker AI: AWS suits systems that care about Decouple individual user permissions with job execution for higher automation and MLOps practices. AWS suits for big data science team with granular job and pipeline definitions and isolated environments.

Reference

Data Storage

You will have the query — can I store the info within the working directory? At the very least that’s been my query for a very long time, and I imagine the reply remains to be yes for those who are experimenting or prototyping using an easy script or notebook in an interactive development environment. But data storage location is significant to contemplate within the context of making ML jobs.

Since code runs in a cloud-managed environment or a docker container separate out of your local directory, any locally stored data can’t be accessed when executing pipelines and jobs in SageMaker or Azure ML. This requires centralized, managed data storage services. In Azure, that is handled through a storage account throughout the Workspace that supports datastores and data assets.

Datastores contain connection information, while data assets are versioned snapshots of knowledge used for training or inference. AWS, then again, relies heavily on S3 buckets as centralized storage locations that enable secure, durable, cross-region access across different accounts, and users can access data through its unique URI path.

Azure ML

Azure ML treats data as attached resources and assets within the Workspaces, with one storage account and 4 built-in datastores routinely created upon the instantiation of every Workspace so as to store files (in Azure File Share) and datasets (in Azure Blob Storage).

Since datastores securely keep data connection information and routinely handle the credential/identity behind the scene, it decouples data location and access permission from the code, in order that the code to stay unchanged even when the underlying data connection changes. Datastores might be accessed through their unique URI. Here’s an example of making an Input object with the sort uri_file by passing the datastore path.

# create training data using Datastore
training_data=Input(
          type="uri_file",
          path="",
)

Then this data might be used because the training data for an AutoML classification job.

classification_job = automl.classification(
    compute='aml-cluster',
    training_data=training_data,
    target_column_name='Survived',
    primary_metric='accuracy',
)

Data Asset is another choice to access data in an ML job, especially when it is helpful to maintain track of multiple data versions, so data scientists can discover the proper data snapshots getting used for model constructing or experimentations. Here is an example code for creating an Input object with AssetTypes.URI_FILE type by passing the info asset path (which incorporates the info asset name + version number) and using the mode InputOutputModes.RO_MOUNT for read only access. Yow will discover more information within the documentation “Access data in a job”.

# creating training data using Data Asset
training_data = Input(
    type=AssetTypes.URI_FILE,      
    path="azureml:my_train_data:1",  
    mode=InputOutputModes.RO_MOUNT
)

AWS SageMaker

AWS SageMaker is tightly integrated with Amazon S3 (Easy Storage Service) for ML workflows, in order that SageMaker training jobs, inference endpoints, and pipelines can process input data from S3 buckets and write output data back to them. Chances are you’ll find that making a SageMaker managed job environment (which shall be discussed in Part 2) requires S3 bucket location as a key parameter, alternatively a default bucket shall be created if unspecified.

Unlike Azure ML’s Workspace-centric datastore approach, AWS S3 is a standalone data storage service that gives scalable, durable, and secure cloud storage that might be shared across other AWS services and accounts. This offers more flexibility for permission management at the person folder level, but at the identical time requires explicitly granting the SageMaker execution role access to the S3 bucket.

On this code snippet, we use estimator.fit(train_data_uri)to suit the model on the training data by passing its S3 URI directly, then generates the output model and stores it at the required S3 bucket location. More scenarios might be present in their documentation: “Amazon S3 examples using SDK for Python (Boto3)”.

import sagemaker
# Define S3 paths
train_data_uri = ""
output_folder_uri = ""

# Use in training job
estimator = Estimator(
    image_uri=image_uri,
    role=role,
    instance_type="ml.m5.xlarge",
    output_path=output_folder_uri
)

estimator.fit(train_data_uri)

What does it mean in practice?

Azure ML: use Datastore to administer data connections, which handles the credential/identity information behind the scene. Due to this fact, this approach decouples data location and access permission from the code, allowing the code remain unchanged when the underlying connection changes.
AWS SageMaker: use S3 buckets as the first data storage service for managing input and output data of SageMaker jobs through their URI paths. This approach requires explicit permission management to grant the SageMaker execution role access to the required S3 bucket.

Reference

Take-Home Message

Compare Azure ML and AWS SageMaker for scalable model training, specializing in project setup, permission management, and data storage patterns, so teams can higher align platform decisions with their existing cloud ecosystem and preferred MLOps workflows.

Partly 1, we compare the high-level project setup and permission management, storing and accessing the info required for model training. Part 2 will discuss various compute options under each cloud platforms, and the creation and management of runtime environments for training jobs.

Azure ML vs. AWS SageMaker: A Deep Dive into Model Training — Part 1

Azure ML & AWS SageMaker Training Jobs

Project and Permission Management

Azure ML

AWS SageMaker

What does this mean in practice?

Data Storage

Azure ML

AWS SageMaker

What does it mean in practice?

Reference

Take-Home Message

Related Resources

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

SAM 3 vs. Specialist Models — A Performance Benchmark

Introducing BERTopic Integration with the Hugging Face Hub

Announcing the Open Source AI Game Jam 🎮

AI Speech Recognition in Unity

The Falcon has landed within the Hugging Face ecosystem

Azure ML vs. AWS SageMaker: A Deep Dive into Model Training — Part 1

Azure ML & AWS SageMaker Training Jobs

Project and Permission Management

Azure ML

AWS SageMaker

What does this mean in practice?

Data Storage

Azure ML

AWS SageMaker

What does it mean in practice?

Reference

Take-Home Message

Related Resources

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.