Home Artificial Intelligence Effective-tune MPT-7B on Amazon SageMaker 1. Install dependencies and set S3 paths 2. Construct a fine-tuning dataset 3. SageMaker Training job 4. Summary

Effective-tune MPT-7B on Amazon SageMaker 1. Install dependencies and set S3 paths 2. Construct a fine-tuning dataset 3. SageMaker Training job 4. Summary

1
Effective-tune MPT-7B on Amazon SageMaker
1. Install dependencies and set S3 paths
2. Construct a fine-tuning dataset
3. SageMaker Training job
4. Summary

Learn methods to prepare a dataset and create a training job to fine-tune MPT-7B on Amazon SageMaker

New large language models (LLMs) are being announced every week, each attempting to beat its predecessor and take over the evaluation leaderboards. Certainly one of the most recent models out there may be MPT-7B that was released by MosaicML. Unlike other models of its kind, this 7-billion-parameter model is open-source and licensed for business use (Apache 2.0 license) πŸš€.

Foundation models like MPT-7B are pre-trained on datasets with trillions of tokens (100 tokens ~ 75 words) crawled from the net and, when prompted well, they’ll produce impressive outputs. Nevertheless, to really unlock the worth of enormous language models in real-world applications, smart prompt-engineering may not be enough to make them work on your use case and, due to this fact, fine-tuning a foundation model on a domain-specific dataset is required.

LLMs have billions of parameters and, consequently, fine-tuning such large models is difficult. Excellent news is that fine-tuning is less expensive and faster as in comparison with pre-training the muse model provided that 1) the domain-specific datasets are “small” and a pair of) fine-tuning requires only a number of passes over the training data.

  • create and structure a dataset for fine-tuning a big language model.
  • What’s and methods to configure a distributed training job with fully sharded data parallel.
  • define a 😊 HuggingFace estimator.
  • launch a training job in Amazon SageMaker that fine-tunes MPT-7B.

Let’s start by installing the SageMaker Python SDK and a number of other packages. This SDK makes it possible to coach and deploy machine learning models on AWS with a number of lines of Python code. The code below is on the market within the sagemaker_finetuning.ipynbnotebook in Github. Run the notebook in SageMaker Studio, a SageMaker notebook instance, or in your laptop after authenticating to an AWS account.

!pip install "sagemaker==2.162.0" s3path boto3 --quiet

from sagemaker.huggingface import HuggingFace
from sagemaker.inputs import TrainingInput
from sagemaker import s3_utils
import sagemaker
import boto3
import json

Next step is to define the paths where the information might be saved in S3 and create a SageMaker session.

# Define S3 paths
bucket = ""
training_data_path = f"s3://{bucket}/toy_data/train/data.jsonl"
test_data_path = f"s3://{bucket}/toy_data/test/data.jsonl"
output_path = f"s3://{bucket}/outputs"
code_location = f"s3://{bucket}/code"

# Create SageMaker session
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()

We’ll create a dummy dataset to display methods to fine-tune MPT-7B. Since training models of this size on a whole dataset takes long and is dear, it’s a great idea to first test & debug the training job on a small dataset and second scale training to the whole dataset.

  • β€” The dataset ought to be formatted as an inventory of dictionaries, where each example has a key-value structure, e.g.,
{
"prompt": "What's a Pastel de Nata?",
"response": "A Pastel de Nata is a Portuguese egg custard tart pastry, optionally dusted with cinnamon."
}

The prompt is the input given to the model (e.g., a matter). The response is the output that the model is trained to predict (e.g., the reply to the query within the prompt). The raw prompt is commonly preprocessed to slot in a prompt template that helps the model to generate higher outputs. Note that the model is trained for causal language modelling, so you possibly can consider it as a “document completer”. It’s a great idea to design the prompt template in such a way that the model thinks that it’s completing a document. Andrej Karpathy explains well this mechanism in his talk State of GPT.

prompt_template = """Write a response that appropriately answers the query below.
### Query:
{query}

### Response:
"""

dataset = [
{"prompt": "What is a Pastel de Nata?",
"response": "A Pastel de Nata is a Portuguese egg custard tart pastry, optionally dusted with cinnamon."},
{"prompt": "Which museums are famous in Amsterdam?",
"response": "Amsterdam is home to various world-famous museums, and no trip to the city is complete without stopping by the Rijksmuseum, Van Gogh Museum, or Stedelijk Museum."},
{"prompt": "Where is the European Parliament?",
"response": "Strasbourg is the official seat of the European Parliament."},
{"prompt": "How is the weather in The Netherlands?",
"response": "The Netherlands is a country that boasts a typical maritime climate with mild summers and cold winters."},
{"prompt": "What are Poffertjes?",
"response": "Poffertjes are a traditional Dutch batter treat. Resembling small, fluffy pancakes, they are made with yeast and buckwheat flour."},
]

# Format prompt based on template
for instance in dataset:
example["prompt"] = prompt_template.format(query=example["prompt"])

training_data, test_data = dataset[0:4], dataset[4:]

print(f"Size of coaching data: {len(training_data)}nSize of test data: {len(test_data)}")

  • β€” Once the training and test sets are ready and formatted as an inventory of dictionaries, we upload them to S3 as JSON lines using the utility function below:
def write_jsonlines_to_s3(data, s3_path):
"""Writes list of dictionaries as a JSON lines file to S3"""

json_string = ""
for d in data:
json_string += json.dumps(d) + "n"

s3_client = boto3.client("s3")

bucket, key = s3_utils.parse_s3_url(s3_path)
s3_client.put_object(
Body = json_string,
Bucket = bucket,
Key = key,
)

write_jsonlines_to_s3(training_data, training_data_path)
write_jsonlines_to_s3(test_data, test_data_path)

With the datasets available in S3, we’ll now create a training job in Amazon SageMaker. For that, now we have to create an entry point script, modify the configuration file specifying the training settings, and define an HuggingFace estimator. We’ll (re-)use the training script from LLM Foundry and Composer library’s CLI launcher that sets up the distributed training environment. Each of those packages are maintained by MosaicML, the corporate behind MPT-7B. The working folder ought to be structured like:

└── /
β”œβ”€β”€ training_script_launcher.sh
β”œβ”€β”€ fine_tuning_config.yaml
β”œβ”€β”€ sagemaker_finetuning.ipynb

We’ll now dive deep into each of those files.

  • β€” The template provided within the LLM Foundry repository is a great start line, specifically the mpt-7b-dolly-sft.yaml file. Nevertheless, depending in your dataset size and training instance, you may need to adjust a few of these configurations, comparable to the batch size. I actually have modified the file to fine-tune the model in SageMaker (check finetuning_config.yaml). The parameters that you need to listen to are the next:
max_seq_len: 512
global_seed: 17

...
# Dataloaders
train_loader:
name: finetuning
dataset:
hf_name: json
hf_kwargs:
data_dir: /opt/ml/input/data/train/
...

eval_loader:
name: finetuning
dataset:
hf_name: json
hf_kwargs:
data_dir: /opt/ml/input/data/test/

...
max_duration: 3ep
eval_interval: 1ep
...
global_train_batch_size: 128

...
# FSDP
fsdp_config:
sharding_strategy: FULL_SHARD
mixed_precision: PURE
activation_checkpointing: true
activation_checkpointing_reentrant: false
activation_cpu_offload: false
limit_all_gathers: true
verbose: false

# Checkpoint to local filesystem or distant object store
save_folder: /tmp/checkpoints
dist_timeout: 2000

Themax_seq_length indicates the utmost variety of tokens of the input (do not forget that 100 tokens ~ 75 words). The training and test data might be loaded using the 😊 Datasets library from the /opt/ml/input/data/{train, test} directory contained in the container related to the training job. Try the SageMaker Training Storage Folders’ documentation to know how the container directories are structured. The max_duration specifies the variety of epochs for fine-tuning. Two to 3 epochs is often a great alternative. eval_interval indicates how often the model might be evaluated on the test set.

The distributed training strategy is Fully Sharded Data Parallel (FSDP), which enables efficient training of enormous models like MPT-7B. Unlike the standard data parallel strategy, which keeps a duplicate of the model in each GPU, FSDP shards model parameters, optimizer states, and gradients across data parallel employees. If you need to learn more about FSDP, check this insightful PyTorch intro post. FSDP is integrated in Composer, the distributed training library utilized by LLM Foundry.

save_folder determines where the model checkpoint (.pt file) is saved. We set it to the temporary folder /tmp/checkpoints.

  • β€” A bash script is used as entry point. The bash script clones the LLM Foundry repository, installs requirements, and, more importantly, runs the training script using Composer library’s distributed launcher. Note that, typically, training jobs in SageMaker run the training script using a command like python train.py. Nevertheless, it is feasible to pass a bash script as entry point, which provides more flexibility in our scenario. Finally, we convert the model checkpoint saved to /tmp/checkpoints to the HuggingFace model format and save the ultimate artifacts into /opt/ml/model/. SageMaker will compress all files on this directory, create a tarball model.tar.gz, and upload it to S3. The tarball is helpful for inference.
# Clone llm-foundry package from MosaicML
# That is where the training script is hosted
git clone https://github.com/mosaicml/llm-foundry.git
cd llm-foundry

# Install required packages
pip install -e ".[gpu]"
pip install git+https://github.com/mosaicml/composer.git@dev

# Run training script with fine-tuning configuration
composer scripts/train/train.py /opt/ml/code/finetuning_config.yaml

# Convert Composer checkpoint to HuggingFace model format
python scripts/inference/convert_composer_to_hf.py
--composer_path /tmp/checkpoints/latest-rank0.pt
--hf_output_path /opt/ml/model/hf_fine_tuned_model
--output_precision bf16

# Print content of the model artifact directory
ls /opt/ml/model/

  • β€” The Estimator sets the Docker container used to run the training job. We’ll use a picture with PyTorch 2.0.0 and Python 3.10. The bash script and the configuration file are mechanically uploaded to S3 and made available contained in the container (handled by the SageMaker Python SDK). We set the training instance tog5.48xlarge that has 8x NVIDIA A10G GPUs. The p4d.24xlarge can be a great alternative. Although it’s costlier, it is supplied with 8x NVIDIA A100 GPUs. We also indicate the metrics to trace on the training and test sets (Cross Entropy and Perplexity). The values of those metrics are captured via expressions and sent to Amazon CloudWatch.
# Define container image for the training job
training_image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-training:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04-v1.1"

# Define metrics to send to CloudWatch
metrics = [
# On training set
(.d+)))",
(.d+)))",
# On test set
(.d+)))",
(.d+)))",
]

estimator_args = {
"image_uri": training_image_uri, # Training container image
"entry_point": "launcher.sh", # Launcher bash script
"source_dir": ".", # Directory with launcher script and configuration file
"instance_type": "ml.g5.48xlarge", # Instance type
"instance_count": 1, # Number of coaching instances
"base_job_name": "fine-tune-mpt-7b", # Prefix of the training job name
"role": role, # IAM role
"volume_size": 300, # Size of the EBS volume attached to the instance (GB)
"py_version": "py310", # Python version
"metric_definitions": metrics, # Metrics to trace
"output_path": output_path, # S3 location where the model artifact might be uploaded
"code_location": code_location, # S3 location where the source code might be saved
"disable_profiler": True, # Don't create profiler instance
"keep_alive_period_in_seconds": 240, # Enable Warm Pools while experimenting
}

huggingface_estimator = HuggingFace(**estimator_args)

⚠️ Be sure to request the respective quotas for SageMaker Training, together with Warm Pools’ quota in case you’re making use of this cool feature. In case you plan to run many roles in SageMaker, take a have a look at SageMaker Saving Plans.

  • β€” We now have all set to begin the training job on Amazon SageMaker:
huggingface_estimator.fit({
"train": TrainingInput(
s3_data=training_data_path,
content_type="application/jsonlines"),
"test": TrainingInput(
s3_data=test_data_path,
content_type="application/jsonlines"),
}, wait=True)

The training time will rely on the dimensions of your dataset. With our dummy dataset, training takes roughly to finish. Once the model is trained and converted to 😊 HuggingFace format, SageMaker will upload the model tarball (model.tar.gz) to the S3 output_path. I discovered that in practice the uploading step takes somewhat long (>1h), which is likely to be as a consequence of the dimensions of the model artifacts to compress (~25GB).

In this text, I showed how you possibly can prepare a dataset and create a training job in SageMaker to fine-tune MPT-7B on your use case. The implementation leverages the training script from LLM Foundry and uses Composer library’s distributed training launcher. Once you’ve fine-tuned your model and need to deploy it, I like to recommend to ascertain out the blog posts by Philipp Schmid; there are many examples on methods to deploy LLMs in SageMaker. Have a good time along with your fine-tuned MPT-7B model! πŸŽ‰

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here