Amazon SageMaker and Hugging Face

-


Philipp Schmid's avatar

hugging-face-and-aws-logo

Take a look at these smiles!

Today, we announce a strategic partnership between Hugging Face and Amazon to make it easier for corporations to leverage State of the Art Machine Learning models, and ship cutting-edge NLP features faster.

Through this partnership, Hugging Face is leveraging Amazon Web Services as its Preferred Cloud Provider to deliver services to its customers.

As a primary step to enable our common customers, Hugging Face and Amazon are introducing latest Hugging Face Deep Learning Containers (DLCs) to make it easier than ever to coach Hugging Face Transformer models in Amazon SageMaker.

To learn the best way to access and use the brand new Hugging Face DLCs with the Amazon SageMaker Python SDK, try the guides and resources below.

On July eighth, 2021 we prolonged the Amazon SageMaker integration so as to add easy deployment and inference of Transformers models. If you need to find out how you’ll be able to deploy Hugging Face models easily with Amazon SageMaker take a take a look at the latest blog post and the documentation.




Features & Advantages 🔥



One Command is All you Need

With the brand new Hugging Face Deep Learning Containers available in Amazon SageMaker, training cutting-edge Transformers-based NLP models has never been simpler. There are variants specially optimized for TensorFlow and PyTorch, for single-GPU, single-node multi-GPU and multi-node clusters.



Accelerating Machine Learning from Science to Production

Along with Hugging Face DLCs, we created a first-class Hugging Face extension to the SageMaker Python-sdk to speed up data science teams, reducing the time required to establish and run experiments from days to minutes.

You should use the Hugging Face DLCs with the Automatic Model Tuning capability of Amazon SageMaker, with a view to routinely optimize your training hyperparameters and quickly increase the accuracy of your models.

Due to the SageMaker Studio web-based Integrated Development Environment (IDE), you’ll be able to easily track and compare your experiments and your training artifacts.



Built-in Performance

With the Hugging Face DLCs, SageMaker customers will profit from built-in performance optimizations for PyTorch or TensorFlow, to coach NLP models faster, and with the flexibleness to decide on the training infrastructure with the very best price/performance ratio in your workload.

The Hugging Face DLCs are fully integrated with the SageMaker distributed training libraries, to coach models faster than was ever possible before, using the most recent generation of instances available on Amazon EC2.




Resources, Documentation & Samples 📄

Below you will discover all of the essential resources to all published blog posts, videos, documentation, and sample Notebooks/scripts.



Blog/Video



Documentation



Sample Notebook




Getting began: End-to-End Text Classification 🧭

On this getting began guide, we’ll use the brand new Hugging Face DLCs and Amazon SageMaker extension to coach a transformer model on binary text classification using the transformers and datasets libraries.

We are going to use an Amazon SageMaker Notebook Instance for the instance. You possibly can learn here the best way to arrange a Notebook Instance.

What are we going to do:

  • arrange a development environment and install sagemaker
  • create the training script train.py
  • preprocess our data and upload it to Amazon S3
  • create a HuggingFace Estimator and train our model



Arrange a development environment and install sagemaker

As mentioned above we’re going to use SageMaker Notebook Instances for this. To start it’s essential to jump into your Jupyer Notebook or JupyterLab and create a brand new Notebook with the conda_pytorch_p36 kernel.

Note: Using Jupyter is optional: We could also launch SageMaker Training jobs from anywhere now we have an SDK installed, connectivity to the cloud and appropriate permissions, comparable to a Laptop, one other IDE or a task scheduler like Airflow or AWS Step Functions.

After that we will install the required dependencies

pip install "sagemaker>=2.31.0" "transformers==4.6.1" "datasets[s3]==1.6.2" --upgrade

To run training on SageMaker we’d like to create a sagemaker Session and supply an IAM role with the fitting permission. This IAM role will probably be later attached to the TrainingJob enabling it to download data, e.g. from Amazon S3.

import sagemaker

sess = sagemaker.Session()


sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")



Create the training script train.py

In a SageMaker TrainingJob we’re executing a python script with named arguments. In this instance, we use PyTorch along with transformers. The script will

  • pass the incoming parameters (hyperparameters from HuggingFace Estimator)
  • load our dataset
  • define our compute metrics function
  • arrange our Trainer
  • run training with trainer.train()
  • evaluate the training and save our model at the tip to S3.
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from datasets import load_from_disk
import random
import logging
import sys
import argparse
import os
import torch

if __name__ == "__main__":

    parser = argparse.ArgumentParser()

    
    parser.add_argument("--epochs", type=int, default=3)
    parser.add_argument("--train-batch-size", type=int, default=32)
    parser.add_argument("--eval-batch-size", type=int, default=64)
    parser.add_argument("--warmup_steps", type=int, default=500)
    parser.add_argument("--model_name", type=str)
    parser.add_argument("--learning_rate", type=str, default=5e-5)

    
    parser.add_argument("--output-data-dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
    parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"])
    parser.add_argument("--n_gpus", type=str, default=os.environ["SM_NUM_GPUS"])
    parser.add_argument("--training_dir", type=str, default=os.environ["SM_CHANNEL_TRAIN"])
    parser.add_argument("--test_dir", type=str, default=os.environ["SM_CHANNEL_TEST"])

    args, _ = parser.parse_known_args()

    
    logger = logging.getLogger(__name__)

    logging.basicConfig(
        level=logging.getLevelName("INFO"),
        handlers=[logging.StreamHandler(sys.stdout)],
        format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    )

    
    train_dataset = load_from_disk(args.training_dir)
    test_dataset = load_from_disk(args.test_dir)

    logger.info(f" loaded train_dataset length is: {len(train_dataset)}")
    logger.info(f" loaded test_dataset length is: {len(test_dataset)}")

    
    def compute_metrics(pred):
        labels = pred.label_ids
        preds = pred.predictions.argmax(-1)
        precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary")
        acc = accuracy_score(labels, preds)
        return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

    
    model = AutoModelForSequenceClassification.from_pretrained(args.model_name)

    
    training_args = TrainingArguments(
        output_dir=args.model_dir,
        num_train_epochs=args.epochs,
        per_device_train_batch_size=args.train_batch_size,
        per_device_eval_batch_size=args.eval_batch_size,
        warmup_steps=args.warmup_steps,
        evaluation_strategy="epoch",
        logging_dir=f"{args.output_data_dir}/logs",
        learning_rate=float(args.learning_rate),
    )

    
    trainer = Trainer(
        model=model,
        args=training_args,
        compute_metrics=compute_metrics,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
    )

    
    trainer.train()

    
    eval_result = trainer.evaluate(eval_dataset=test_dataset)

    
    with open(os.path.join(args.output_data_dir, "eval_results.txt"), "w") as author:
        print(f"***** Eval results *****")
        for key, value in sorted(eval_result.items()):
            author.write(f"{key} = {value}n")

    
    trainer.save_model(args.model_dir)



Preprocess our data and upload it to s3

We use the datasets library to download and preprocess our imdb dataset. After preprocessing, the dataset will probably be uploaded to the present session’s default s3 bucket sess.default_bucket() used inside our training job. The imdb dataset consists of 25000 training and 25000 testing highly polar movie reviews.

import botocore
from datasets import load_dataset
from transformers import AutoTokenizer
from datasets.filesystems import S3FileSystem


tokenizer_name = 'distilbert-base-uncased'


s3 = S3FileSystem()


dataset_name = 'imdb'


s3_prefix = 'datasets/imdb'


dataset = load_dataset(dataset_name)


tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)


def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True)


train_dataset, test_dataset = load_dataset('imdb', split=['train', 'test'])
test_dataset = test_dataset.shuffle().select(range(10000)) 


train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
test_dataset = test_dataset.map(tokenize, batched=True, batch_size=len(test_dataset))


train_dataset = train_dataset.rename_column("label", "labels")
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset = test_dataset.rename_column("label", "labels")
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])


training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'
train_dataset.save_to_disk(training_input_path,fs=s3)


test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test'
test_dataset.save_to_disk(test_input_path,fs=s3)



Create a HuggingFace Estimator and train our model

In an effort to create a SageMaker Trainingjob we will use a HuggingFace Estimator. The Estimator handles the end-to-end Amazon SageMaker training. In an Estimator, we define which fine-tuning script needs to be used as entry_point, which instance_type needs to be used, which hyperparameters are passed in. Along with this, quite a few advanced controls can be found, comparable to customizing the output and checkpointing locations, specifying the local storage size or network configuration.

SageMaker takes care of starting and managing all of the required Amazon EC2 instances for us with the Hugging Face DLC, it uploads the provided fine-tuning script, for instance, our train.py, then downloads the information from the S3 bucket, sess.default_bucket(), into the container. Once the information is prepared, the training job will start routinely by running.

/opt/conda/bin/python train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32

The hyperparameters you define within the HuggingFace Estimator are passed in as named arguments.

from sagemaker.huggingface import HuggingFace


hyperparameters={'epochs': 1,
                 'train_batch_size': 32,
                 'model_name':'distilbert-base-uncased'
                 }


huggingface_estimator = HuggingFace(
      entry_point='train.py',
      source_dir='./scripts',
      instance_type='ml.p3.2xlarge',
      instance_count=1,
      role=role,
      transformers_version='4.6',
      pytorch_version='1.7',
      py_version='py36',
      hyperparameters = hyperparameters
)

To start out our training we call the .fit() method and pass our S3 uri as input.


huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})



Additional Features 🚀

Along with the Deep Learning Container and the SageMaker SDK, now we have implemented other additional features.



Distributed Training: Data-Parallel

You should use SageMaker Data Parallelism Library out of the box for distributed training. We added the functionality of Data Parallelism directly into the Trainer. In case your train.py uses the Trainer API you simply must define the distribution parameter within the HuggingFace Estimator.


distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}


huggingface_estimator = HuggingFace(
        entry_point='train.py',
        source_dir='./scripts',
        instance_type='ml.p3dn.24xlarge',
        instance_count=2,
        role=role,
        transformers_version='4.4.2',
        pytorch_version='1.6.0',
        py_version='py36',
        hyperparameters = hyperparameters
        distribution = distribution
)

The “Getting began: End-to-End Text Classification 🧭” example will be used for distributed training with none changes.



Distributed Training: Model Parallel

You should use SageMaker Model Parallelism Library out of the box for distributed training. We added the functionality of Model Parallelism directly into the Trainer. In case your train.py uses the Trainer API you simply must define the distribution parameter within the HuggingFace Estimator.
For detailed information in regards to the adjustments have a look here.


mpi_options = {
    "enabled" : True,
    "processes_per_host" : 8
}

smp_options = {
    "enabled":True,
    "parameters": {
        "microbatches": 4,
        "placement_strategy": "spread",
        "pipeline": "interleaved",
        "optimize": "speed",
        "partitions": 4,
        "ddp": True,
    }
}

distribution={
    "smdistributed": {"modelparallel": smp_options},
    "mpi": mpi_options
}

 
huggingface_estimator = HuggingFace(
        entry_point='train.py',
        source_dir='./scripts',
        instance_type='ml.p3dn.24xlarge',
        instance_count=2,
        role=role,
        transformers_version='4.4.2',
        pytorch_version='1.6.0',
        py_version='py36',
        hyperparameters = hyperparameters,
        distribution = distribution
)



Spot instances

With the creation of HuggingFace Framework extension for the SageMaker Python SDK we can even leverage the advantage of fully-managed EC2 spot instances and save as much as 90% of our training cost.

Note: Unless your training job will complete quickly, we recommend you utilize checkpointing with managed spot training, due to this fact it’s essential to define the checkpoint_s3_uri.

To make use of spot instances with the HuggingFace Estimator now we have to set the use_spot_instances parameter to True and define your max_wait and max_run time. You possibly can read more about the managed spot training lifecycle here.


hyperparameters={'epochs': 1,
                 'train_batch_size': 32,
                 'model_name':'distilbert-base-uncased',
                 'output_dir':'/opt/ml/checkpoints'
                 }


huggingface_estimator = HuggingFace(
        entry_point='train.py',
        source_dir='./scripts',
        instance_type='ml.p3.2xlarge',
        instance_count=1,
          checkpoint_s3_uri=f's3://{sess.default_bucket()}/checkpoints'
        use_spot_instances=True,
        max_wait=3600, 
        max_run=1000,
        role=role,
        transformers_version='4.4',
        pytorch_version='1.6',
        py_version='py36',
        hyperparameters = hyperparameters
)






Git Repositories

Once you create an HuggingFace Estimator, you’ll be able to specify a training script that’s stored in a GitHub repository because the entry point for the estimator, so that you just don’t need to download the scripts locally. If Git support is enabled, then entry_point and source_dir needs to be relative paths within the Git repo if provided.

For example to make use of git_config with an example script from the transformers repository.

Remember that it’s essential to define output_dir as a hyperparameter for the script to avoid wasting your model to S3 after training. Suggestion: define output_dir as /opt/ml/model because it is the default SM_MODEL_DIR and will probably be uploaded to S3.


git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'master'}

 
huggingface_estimator = HuggingFace(
        entry_point='run_glue.py',
        source_dir='./examples/text-classification',
        git_config=git_config,
        instance_type='ml.p3.2xlarge',
        instance_count=1,
        role=role,
        transformers_version='4.4',
        pytorch_version='1.6',
        py_version='py36',
        hyperparameters=hyperparameters
)



SageMaker Metrics

SageMaker Metrics can routinely parse the logs for metrics and send those metrics to CloudWatch. In the event you want SageMaker to parse logs you may have to specify the metrics that you just want SageMaker to send to CloudWatch if you configure the training job. You specify the name of the metrics that you need to send and the regular expressions that SageMaker uses to parse the logs that your algorithm emits to seek out those metrics.



metric_definitions = [
{"Name": "train_runtime", "Regex": "train_runtime.*=D*(.*?)$"},
{"Name": "eval_accuracy", "Regex": "eval_accuracy.*=D*(.*?)$"},
{"Name": "eval_loss", "Regex": "eval_loss.*=D*(.*?)$"},
]



huggingface_estimator = HuggingFace(
        entry_point='train.py',
        source_dir='./scripts',
        instance_type='ml.p3.2xlarge',
        instance_count=1,
        role=role,
        transformers_version='4.4',
        pytorch_version='1.6',
        py_version='py36',
        metric_definitions=metric_definitions,
        hyperparameters = hyperparameters
)



FAQ 🎯

You could find the entire Continuously Asked Questions within the documentation.

Q: What are Deep Learning Containers?

A: Deep Learning Containers (DLCs) are Docker images pre-installed with deep learning frameworks and libraries (e.g. transformers, datasets, tokenizers) to make it easy to coach models by letting you skip the complicated strategy of constructing and optimizing your environments from scratch.

Q: Do I actually have to make use of the SageMaker Python SDK to make use of the Hugging Face Deep Learning Containers?

A: You should use the HF DLC without the SageMaker Python SDK and launch SageMaker Training jobs with other SDKs, comparable to the AWS CLI or boto3. The DLCs are also available through Amazon ECR and will be pulled and utilized in any environment of alternative.

Q: Why should I exploit the Hugging Face Deep Learning Containers?

A: The DLCs are fully tested, maintained, optimized deep learning environments that require no installation, configuration, or maintenance.

Q: Why should I exploit SageMaker Training to coach Hugging Face models?

A: SageMaker Training provides quite a few advantages that can boost your productivity with Hugging Face : (1) first it’s cost-effective: the training instances live only throughout your job and are paid per second. No risk anymore to depart GPU instances up all night: the training cluster stops right at the tip of your job! It also supports EC2 Spot capability, which enables as much as 90% cost reduction. (2) SageMaker also comes with numerous built-in automation that facilitates teamwork and MLOps: training metadata and logs are routinely endured to a serverless managed metastore, and I/O with S3 (for datasets, checkpoints and model artifacts) is fully managed. Finally, SageMaker also allows to drastically scale up and out: you’ll be able to launch multiple training jobs in parallel, but in addition launch large-scale distributed training jobs

Q: Once I’ve trained my model with Amazon SageMaker, can I exploit it with 🤗/Transformers ?

A: Yes, you’ll be able to download your trained model from S3 and directly use it with transformers or upload it to the Hugging Face Model Hub.

Q: How is my data and code secured by Amazon SageMaker?

A: Amazon SageMaker provides quite a few security mechanisms including encryption at rest and in transit, Virtual Private Cloud (VPC) connectivity and Identity and Access Management (IAM). To learn more about security within the AWS cloud and with Amazon SageMaker, you’ll be able to visit Security in Amazon SageMaker and AWS Cloud Security.

Q: Is that this available in my region?

A: For an inventory of the supported regions, please visit the AWS region table for all AWS global infrastructure.

Q: Do I would like to pay for a license from Hugging Face to make use of the DLCs?

A: No – the Hugging Face DLCs are open source and licensed under Apache 2.0.

Q: How can I run inference on my trained models?

A: You might have multiple options to run inference in your trained models. One option is to make use of Hugging Face Accelerated Inference-API hosted service: start by uploading the trained models to your Hugging Face account to deploy them publicly, or privately. One other great option is to make use of SageMaker Inference to run your individual inference code in Amazon SageMaker. We’re working on offering an integrated solution for Amazon SageMaker with Hugging Face Inference DLCs in the long run – stay tuned!

Q: Do you offer premium support or support SLAs for this solution?

A: AWS Technical Support tiers can be found from AWS and canopy development and production issues for AWS services and products – please seek advice from AWS Support for specifics and scope.

If you may have questions which the Hugging Face community can assist answer and/or profit from, please post them within the Hugging Face forum.

In the event you need premium support from the Hugging Face team to speed up your NLP roadmap, our Expert Acceleration Program offers direct guidance from our open source, science and ML Engineering team – contact us to learn more.

Q: What are you planning next through this partnership?

A: Our common goal is to democratize state-of-the-art Machine Learning. We are going to proceed to innovate to make it easier for researchers, data scientists and ML practitioners to administer, train and run state-of-the-art models. If you may have feature requests for integration in AWS with Hugging Face, please tell us within the Hugging Face community forum.

Q: I exploit Hugging Face with Azure Machine Learning or Google Cloud Platform, what does this partnership mean for me?

A: A foundational goal for Hugging Face is to make the most recent AI accessible to as many individuals as possible, whichever framework or development environment they work in. While we’re focusing integration efforts with Amazon Web Services as our Preferred Cloud Provider, we’ll proceed to work hard to serve all Hugging Face users and customers, regardless of what compute environment they run on.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x