Deploy GPT-J 6B for inference using Hugging Face Transformers and Amazon SageMaker

-


Philipp Schmid's avatar

Almost 6 months ago to the day, EleutherAI released GPT-J 6B, an open-source alternative to OpenAIs GPT-3. GPT-J 6B is the 6 billion parameter successor to EleutherAIs GPT-NEO family, a family of transformer-based language models based on the GPT architecture for text generation.

EleutherAI‘s primary goal is to coach a model that’s equivalent in size to GPT⁠-⁠3 and make it available to the general public under an open license.

During the last 6 months, GPT-J gained a number of interest from Researchers, Data Scientists, and even Software Developers, but it surely remained very difficult to deploy GPT-J into production for real-world use cases and products.

There are some hosted solutions to make use of GPT-J for production workloads, just like the Hugging Face Inference API, or for experimenting using EleutherAIs 6b playground, but fewer examples on how one can easily deploy it into your personal environment.

On this blog post, you’ll learn how one can easily deploy GPT-J using Amazon SageMaker and the Hugging Face Inference Toolkit with a couple of lines of code for scalable, reliable, and secure real-time inference using an everyday size GPU instance with NVIDIA T4 (~500$/m).

But before we get into it, I need to clarify why deploying GPT-J into production is difficult.




Background

The weights of the 6 billion parameter model represent a ~24GB memory footprint. To load it in float32, one would want no less than 2x model size CPU RAM: 1x for initial weights and one other 1x to load the checkpoint. So for GPT-J it will require no less than 48GB of CPU RAM to only load the model.

To make the model more accessible, EleutherAI also provides float16 weights, and transformers has recent options to scale back the memory footprint when loading large language models. Combining all this it should take roughly 12.1GB of CPU RAM to load the model.

from transformers import GPTJForCausalLM
import torch

model = GPTJForCausalLM.from_pretrained(
    "EleutherAI/gpt-j-6B",
        revision="float16",
        torch_dtype=torch.float16,
        low_cpu_mem_usage=True
)

The caveat of this instance is that it takes a really very long time until the model is loaded into memory and prepared to be used. In my experiments, it took 3 minutes and 32 seconds to load the model with the code snippet above on a P3.2xlarge AWS EC2 instance (the model was not stored on disk). This duration might be reduced by storing the model already on disk, which reduces the load time to 1 minute and 23 seconds, which continues to be very long for production workloads where it’s worthwhile to consider scaling and reliability.

For instance, Amazon SageMaker has a 60s limit for requests to reply, meaning the model must be loaded and the predictions to run inside 60s, which in my view makes a number of sense to maintain the model/endpoint scalable and reliable on your workload. If you’ve longer predictions, you might use batch-transform.

In Transformers the models loaded with the from_pretrained method are following PyTorch’s really helpful practice, which takes around 1.97 seconds for BERT [REF]. PyTorch offers an additional alternative way of saving and loading models using torch.save(model, PATH) and torch.load(PATH).

“Saving a model in this fashion will save the whole module using Python’s pickle module. The drawback of this approach is that the serialized data is certain to the particular classes and the precise directory structure used when the model is saved.”

Which means that after we save a model with transformers==4.13.2 it might be potentially incompatible when attempting to load with transformers==4.15.0. Nevertheless, loading models this fashion reduces the loading time by ~12x, right down to 0.166s for BERT.

Applying this to GPT-J implies that we will reduce the loading time from 1 minute and 23 seconds right down to 7.7 seconds, which is ~10.5x faster.

Figure 1. Model load time of BERT and GPTJ



Tutorial

With this approach to saving and loading models, we achieved model loading performance for GPT-J compatible with production scenarios. But we’d like to take into accout that we’d like to align:

Align PyTorch and Transformers version when saving the model with torch.save(model,PATH) and loading the model with torch.load(PATH) to avoid incompatibility.



Save GPT-J using torch.save

To create our torch.load() compatible model file we load GPT-J using Transformers and the from_pretrained method, after which reserve it with torch.save().

from transformers import AutoTokenizer,GPTJForCausalLM
import torch


model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16)

torch.save(model, "gptj.pt")

Now we’re in a position to load our GPT-J model with torch.load() to run predictions.

from transformers import pipeline
import torch


model = torch.load("gptj.pt")

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")


gen = pipeline("text-generation",model=model,tokenizer=tokenizer,device=0)


gen("My Name is philipp")




Create model.tar.gz for the Amazon SageMaker real-time endpoint

Since we will load our model quickly and run inference on it let’s deploy it to Amazon SageMaker.

There are two ways you possibly can deploy transformers to Amazon SageMaker. You may either “Deploy a model from the Hugging Face Hub” directly or “Deploy a model with model_data stored on S3”. Since we are usually not using the default Transformers method we’d like to go along with the second option and deploy our endpoint with the model stored on S3.

For this, we’d like to create a model.tar.gz artifact containing our model weights and extra files we’d like for inference, e.g. tokenizer.json.

We offer uploaded and publicly accessible model.tar.gz artifacts, which might be used with the HuggingFaceModel to deploy GPT-J to Amazon SageMaker.

See “Deploy GPT-J as Amazon SageMaker Endpoint” on how one can use them.

When you still want or must create your personal model.tar.gz, e.g. due to compliance guidelines, you should utilize the helper script convert_gpt.py for this purpose, which creates the model.tar.gz and uploads it to S3.


git clone https://github.com/philschmid/amazon-sagemaker-gpt-j-sample.git


cd amazon-sagemaker-gpt-j-sample


pip3 install -r requirements.txt
python3 convert_gptj.py --bucket_name {model_storage}

The convert_gpt.py should print out an S3 URI much like this. s3://hf-sagemaker-inference/gpt-j/model.tar.gz.



Deploy GPT-J as Amazon SageMaker Endpoint

To deploy our Amazon SageMaker Endpoint we’re going to use the Amazon SageMaker Python SDK and the HuggingFaceModel class.

The snippet below uses the get_execution_role which is just available inside Amazon SageMaker Notebook Instances or Studio. If you would like to deploy a model outside of it check the documentation.

The model_uri defines the placement of our GPT-J model artifact. We’re going to use the publicly available one provided by us.

from sagemaker.huggingface import HuggingFaceModel
import sagemaker


role = sagemaker.get_execution_role()


model_uri="s3://huggingface-sagemaker-models/transformers/4.12.3/pytorch/1.9.1/gpt-j/model.tar.gz"


huggingface_model = HuggingFaceModel(
    model_data=model_uri,
    transformers_version='4.12.3',
    pytorch_version='1.9.1',
    py_version='py38',
    role=role, 
)


predictor = huggingface_model.deploy(
    initial_instance_count=1, 
    instance_type='ml.g4dn.xlarge' 
)

If you would like to use your personal model.tar.gz just replace the model_uri along with your S3 Uri.

The deployment should take around 3-5 minutes.



Run predictions

We will run predictions using the predictor instances created by our .deploy method. To send a request to our endpoint we use the predictor.predict with our inputs.

predictor.predict({
    "inputs": "Are you able to please tell us more details about your "
})

If you would like to customize your predictions using additional kwargs like min_length, try “Usage best practices” below.



Usage best practices

When using generative models, more often than not you would like to configure or customize your prediction to suit your needs, for instance through the use of beam search, configuring the max or min length of the generated sequence, or adjust the temperature to scale back repetition. The Transformers library provides different strategies and kwargs to do that, the Hugging Face Inference toolkit offers the identical functionality using the parameters attribute of your request payload. Below you will discover examples on how one can generate text without parameters, with beam search, and using custom configurations. If you would like to find out about different decoding strategies try this blog post.



Default request

That is an example of a default request using greedy search.

Inference-time after the primary request: 3s

predictor.predict({
    "inputs": "Are you able to please tell us more details about your "
})



Beam search request

That is an example of a request using beam search with 5 beams.

Inference-time after the primary request: 3.3s

predictor.predict({
    "inputs": "Are you able to please tell us more details about your ",
  "parameters" : {
    "num_beams": 5,
  }
})



Parameterized request

That is an example of a request using a custom parameter, e.g. min_length for generating no less than 512 tokens.

Inference-time after the primary request: 38s

predictor.predict({
    "inputs": "Are you able to please tell us more details about your ",
  "parameters" : {
    "max_length": 512,
    "temperature": 0.9,
  }
})



Few-Shot example (advanced)

That is an example of how you might eos_token_id to stop the generation on a certain token, e.g. n ,. or ### for few-shot predictions. Below is a few-shot example for generating tweets for keywords.

Inference-time after the primary request: 15-45s

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

end_sequence="###"
temperature=4
max_generated_token_length=25
prompt= """key: markets
tweet: Take feedback from nature and markets, not from people.
###
key: children
tweet: Possibly we die so we will come back as children.
###
key: startups
tweet: Startups shouldn’t worry about how one can put out fires, they need to worry about how one can start them.
###
key: hugging face
tweet:"""

predictor.predict({
    'inputs': prompt,
  "parameters" : {
    "max_length": int(len(prompt) + max_generated_token_length),
    "temperature": float(temperature),
    "eos_token_id": int(tokenizer.convert_tokens_to_ids(end_sequence)),
    "return_full_text":False
  }
})

To delete your endpoint you possibly can run.

predictor.delete_endpoint()



Conclusion

We successfully managed to deploy GPT-J, a 6 billion parameter language model created by EleutherAI, using Amazon SageMaker. We reduced the model load time from 3.5 minutes right down to 8 seconds to give you the chance to run scalable, reliable inference.

Do not forget that using torch.save() and torch.load() can create incompatibility issues. If you would like to learn more about scaling out your Amazon SageMaker Endpoints try my other blog post: “MLOps: End-to-End Hugging Face Transformers with the Hub & SageMaker Pipelines”.


Thanks for reading! If you’ve any query, be happy to contact me, through Github, or on the forum. You can even connect with me on Twitter or LinkedIn.





Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x