Train BART/T5 for Summarization using 🤗 Transformers and Amazon SageMaker

-


Philipp Schmid's avatar



Open on Github

In case you missed it: on March twenty fifth we announced a collaboration with Amazon SageMaker to make it easier to create State-of-the-Art Machine Learning models, and ship cutting-edge NLP features faster.

Along with the SageMaker team, we built 🤗 Transformers optimized Deep Learning Containers to speed up training of Transformers-based models. Thanks AWS friends!🤗 🚀

With the brand new HuggingFace estimator within the SageMaker Python SDK, you possibly can start training with a single line of code.

thumbnail

The announcement blog post provides all the knowledge it’s essential to know concerning the integration, including a “Getting Began” example and links to documentation, examples, and features.

listed again here:

In case you’re not acquainted with Amazon SageMaker: “Amazon SageMaker is a completely managed service that gives every developer and data scientist with the power to construct, train, and deploy machine learning (ML) models quickly. SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop prime quality models.” [REF]




Tutorial

We are going to use the brand new Hugging Face DLCs and Amazon SageMaker extension to coach a distributed Seq2Seq-transformer model on the summarization task using the transformers and datasets libraries, after which upload the model to huggingface.co and test it.

As distributed training strategy we’re going to use SageMaker Data Parallelism, which has been built into the Trainer API. To make use of data-parallelism we only must define the distribution parameter in our HuggingFace estimator.


distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

On this tutorial, we’ll use an Amazon SageMaker Notebook Instance for running our training job. You’ll be able to learn here how you can arrange a Notebook Instance.

What are we going to do:

  • Arrange a development environment and install sagemaker
  • Select 🤗 Transformers examples/ script
  • Configure distributed training and hyperparameters
  • Create a HuggingFace estimator and begin training
  • Upload the fine-tuned model to huggingface.co
  • Test inference



Model and Dataset

We’re going to fine-tune facebook/bart-large-cnn on the samsum dataset. “BART is sequence-to-sequence model trained with denoising as pretraining objective.” [REF]

The samsum dataset accommodates about 16k messenger-like conversations with summaries.

{"id": "13818513",
 "summary": "Amanda baked cookies and can bring Jerry some tomorrow.",
 "dialogue": "Amanda: I baked cookies. Do you would like some?rnJerry: Sure!rnAmanda: I'll bring you tomorrow :-)"}



Arrange a development environment and install sagemaker

After our SageMaker Notebook Instance is running we are able to select either Jupyer Notebook or JupyterLab and create a brand new Notebook with the conda_pytorch_p36 kernel.

Note: The usage of Jupyter is optional: We could also launch SageMaker Training jobs from anywhere we have now an SDK installed, connectivity to the cloud and appropriate permissions, comparable to a Laptop, one other IDE or a task scheduler like Airflow or AWS Step Functions.

After that we are able to install the required dependencies

!pip install transformers "datasets[s3]" sagemaker --upgrade

install git-lfs for model upload.

!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | sudo bash
!sudo yum install git-lfs -y
!git lfs install

To run training on SageMaker we’d like to create a sagemaker Session and supply an IAM role with the correct permission. This IAM role will likely be later attached to the TrainingJob enabling it to download data, e.g. from Amazon S3.

import sagemaker

sess = sagemaker.Session()
role = sagemaker.get_execution_role()

print(f"IAM role arn used for running training: {role}")
print(f"S3 bucket used for storing artifacts: {sess.default_bucket()}")



Select 🤗 Transformers examples/ script

The 🤗 Transformers repository accommodates several examples/scripts for fine-tuning models on tasks from language-modeling to token-classification. In our case, we’re using the run_summarization.py from the seq2seq/ examples.

Note: you need to use this tutorial as-is to coach your model on a distinct examples script.

Because the HuggingFace Estimator has git support built-in, we are able to specify a training script stored in a GitHub repository as entry_point and source_dir.

We’re going to use the transformers 4.4.2 DLC which suggests we’d like to configure the v4.4.2 because the branch to tug the compatible example scripts.



git_config = {'repo': 'https://github.com/philschmid/transformers.git','branch': 'master'} 



Configure distributed training and hyperparameters

Next, we’ll define our hyperparameters and configure our distributed training strategy. As hyperparameter, we are able to define any Seq2SeqTrainingArguments and those defined in run_summarization.py.


hyperparameters={
    'per_device_train_batch_size': 4,
    'per_device_eval_batch_size': 4,
    'model_name_or_path':'facebook/bart-large-cnn',
    'dataset_name':'samsum',
    'do_train':True,
    'do_predict': True,
    'predict_with_generate': True,
    'output_dir':'/opt/ml/model',
    'num_train_epochs': 3,
    'learning_rate': 5e-5,
    'seed': 7,
    'fp16': True,
}


distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

Since, we’re using SageMaker Data Parallelism our total_batch_size will likely be per_device_train_batch_size * n_gpus.




Create a HuggingFace estimator and begin training

The last step before training is making a HuggingFace estimator. The Estimator handles the end-to-end Amazon SageMaker training. We define which fine-tuning script ought to be used as entry_point, which instance_type ought to be used, and which hyperparameters are passed in.

from sagemaker.huggingface import HuggingFace


huggingface_estimator = HuggingFace(
      entry_point='run_summarization.py', 
      source_dir='./examples/seq2seq', 
      git_config=git_config,
      instance_type='ml.p3dn.24xlarge',
      instance_count=2,
      transformers_version='4.4.2',
      pytorch_version='1.6.0',
      py_version='py36',
      role=role,
      hyperparameters = hyperparameters,
      distribution = distribution
)

As instance_type we’re using ml.p3dn.24xlarge, which accommodates 8x NVIDIA A100 with an instance_count of two. This implies we’re going to run training on 16 GPUs and a total_batch_size of 16*4=64. We’re going to train a 400 Million Parameter model with a total_batch_size of 64, which is just wow.
To begin our training we call the .fit() method.


huggingface_estimator.fit()
2021-04-01 13:00:35 Starting - Starting the training job...
2021-04-01 13:01:03 Starting - Launching requested ML instancesProfilerReport-1617282031: InProgress
2021-04-01 13:02:23 Starting - Preparing the instances for training......
2021-04-01 13:03:25 Downloading - Downloading input data...
2021-04-01 13:04:04 Training - Downloading the training image...............
2021-04-01 13:06:33 Training - Training image download accomplished. Training in progress
....
....
2021-04-01 13:16:47 Uploading - Uploading generated training model
2021-04-01 13:27:49 Accomplished - Training job accomplished
Training seconds: 2882
Billable seconds: 2882

The training seconds are 2882 because they’re multiplied by the variety of instances. If we calculate 2882/2=1441 is it the duration from “Downloading the training image” to “Training job accomplished”.
Converted to real money, our training on 16 NVIDIA Tesla V100-GPU for a State-of-the-Art summarization model comes all the way down to ~28$.




Upload the fine-tuned model to huggingface.co

Since our model achieved a fairly good rating we’re going to upload it to huggingface.co, create a model_card and test it with the Hosted Inference widget. To upload a model it’s essential to create an account here.

We are able to download our model from Amazon S3 and unzip it using the next snippet.

import os
import tarfile
from sagemaker.s3 import S3Downloader

local_path = 'my_bart_model'

os.makedirs(local_path, exist_ok = True)


S3Downloader.download(
    s3_uri=huggingface_estimator.model_data, 
    local_path=local_path, 
    sagemaker_session=sess 
)


tar = tarfile.open(f"{local_path}/model.tar.gz", "r:gz")
tar.extractall(path=local_path)
tar.close()
os.remove(f"{local_path}/model.tar.gz")

Before we’re going to upload our model to huggingface.co we’d like to create a model_card. The model_card describes the model and includes hyperparameters, results, and specifies which dataset was used for training. To create a model_card we create a README.md in our local_path


with open(f"{local_path}/eval_results.json") as f:
    eval_results_raw = json.load(f)
    eval_results={}
    eval_results["eval_rouge1"] = eval_results_raw["eval_rouge1"]
    eval_results["eval_rouge2"] = eval_results_raw["eval_rouge2"]
    eval_results["eval_rougeL"] = eval_results_raw["eval_rougeL"]
    eval_results["eval_rougeLsum"] = eval_results_raw["eval_rougeLsum"]

with open(f"{local_path}/test_results.json") as f:
    test_results_raw = json.load(f)
    test_results={}
    test_results["test_rouge1"] = test_results_raw["test_rouge1"]
    test_results["test_rouge2"] = test_results_raw["test_rouge2"]
    test_results["test_rougeL"] = test_results_raw["test_rougeL"]
    test_results["test_rougeLsum"] = test_results_raw["test_rougeLsum"]

After we extract all of the metrics we wish to incorporate we’re going to create our README.md. Moreover to the automated generation of the outcomes table we add the metrics manually to the metadata of our model card under model-index

import json

MODEL_CARD_TEMPLATE = """
---
language: en
tags:
- sagemaker
- bart
- summarization
license: apache-2.0
datasets:
- samsum
model-index:
- name: {model_name}
  results:
  - task: 
      name: Abstractive Text Summarization
      type: abstractive-text-summarization
    dataset:
      name: "SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization" 
      type: samsum
    metrics:
       - name: Validation ROGUE-1
         type: rogue-1
         value: 42.621
       - name: Validation ROGUE-2
         type: rogue-2
         value: 21.9825
       - name: Validation ROGUE-L
         type: rogue-l
         value: 33.034
       - name: Test ROGUE-1
         type: rogue-1
         value: 41.3174
       - name: Test ROGUE-2
         type: rogue-2
         value: 20.8716
       - name: Test ROGUE-L
         type: rogue-l
         value: 32.1337
widget:
- text: | 
    Jeff: Can I train a 🤗 Transformers model on Amazon SageMaker? 
    Philipp: Sure you need to use the brand new Hugging Face Deep Learning Container. 
    Jeff: okay.
    Jeff: and the way can I start? 
    Jeff: where can I find documentation? 
    Philipp: okay, okay you'll find the whole lot here. https://huggingface.co/blog/the-partnership-amazon-sagemaker-and-hugging-face 
---

## `{model_name}`

This model was trained using Amazon SageMaker and the brand new Hugging Face Deep Learning container.

For more information take a look at:
- [🤗 Transformers Documentation: Amazon SageMaker](https://huggingface.co/transformers/sagemaker.html)
- [Example Notebooks](https://github.com/huggingface/notebooks/tree/master/sagemaker)
- [Amazon SageMaker documentation for Hugging Face](https://docs.aws.amazon.com/sagemaker/latest/dg/hugging-face.html)
- [Python SDK SageMaker documentation for Hugging Face](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/index.html)
- [Deep Learning Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-training-containers)

## Hyperparameters

    {hyperparameters}


## Usage
    from transformers import pipeline
    summarizer = pipeline("summarization", model="philschmid/{model_name}")

    conversation = '''Jeff: Can I train a 🤗 Transformers model on Amazon SageMaker? 
    Philipp: Sure you need to use the brand new Hugging Face Deep Learning Container. 
    Jeff: okay.
    Jeff: and the way can I start? 
    Jeff: where can I find documentation? 
    Philipp: okay, okay you'll find the whole lot here. https://huggingface.co/blog/the-partnership-amazon-sagemaker-and-hugging-face                                           
    '''
    nlp(conversation)

## Results

| key | value |
| --- | ----- |
{eval_table}
{test_table}



"""


model_card = MODEL_CARD_TEMPLATE.format(
    model_name=f"{hyperparameters['model_name_or_path'].split("https://huggingface.co/")[1]}-{hyperparameters['dataset_name']}",
    hyperparameters=json.dumps(hyperparameters, indent=4, sort_keys=True),
    eval_table="n".join(f"| {k} | {v} |" for k, v in eval_results.items()),
    test_table="n".join(f"| {k} | {v} |" for k, v in test_results.items()),
)

with open(f"{local_path}/README.md", "w") as f:
    f.write(model_card)

After we have now our unzipped model and model card situated in my_bart_model we are able to use the either huggingface_hub SDK to create a repository and upload it to huggingface.co – or simply to https://huggingface.co/recent an create a brand new repository and upload it.

from getpass import getpass
from huggingface_hub import HfApi, Repository

hf_username = "philschmid" 
hf_email = "philipp@huggingface.co" 
repository_name = f"{hyperparameters['model_name_or_path'].split("https://huggingface.co/")[1]}-{hyperparameters['dataset_name']}" 
password = getpass("Enter your password:") 


token = HfApi().login(username=hf_username, password=password)


repo_url = HfApi().create_repo(token=token, name=repository_name, exist_ok=True)


model_repo = Repository(use_auth_token=token,
                        clone_from=repo_url,
                        local_dir=local_path,
                        git_user=hf_username,
                        git_email=hf_email)


model_repo.push_to_hub()



Test inference

After we uploaded our model we are able to access it at https://huggingface.co/{hf_username}/{repository_name}

print(f"https://huggingface.co/{hf_username}/{repository_name}")

And use the “Hosted Inference API” widget to check it.

https://huggingface.co/philschmid/bart-large-cnn-samsum

inference



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x