A running document to showcase the way to deploy and fine-tune DeepSeek R1 models with Hugging Face on AWS.
What’s DeepSeek-R1?
If you happen to’ve ever struggled with a troublesome math problem, you already know how useful it’s to think slightly longer and work through it fastidiously. OpenAI’s o1 model showed that when LLMs are trained to do the identical—through the use of more compute during inference—they get significantly higher at solving reasoning tasks like mathematics, coding, and logic.
Nevertheless, the recipe behind OpenAI’s reasoning models has been a well kept secret. That’s, until last week, when DeepSeek released their DeepSeek-R1 model and promptly broke the web (and the stock market!).
DeepSeek AI open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and 6 dense models distilled from DeepSeek-R1 based on Llama and Qwen architectures. You will discover all of them within the DeepSeek R1 collection.
We collaborate with Amazon Web Services to make it easier for developers to deploy the most recent Hugging Face models on AWS services to construct higher generative AI applications.
Let’s review how you’ll be able to deploy and fine-tune DeepSeek R1 models with Hugging Face on AWS.
Deploy DeepSeek R1 models
Deploy on AWS with Hugging Face Inference Endpoints
Hugging Face Inference Endpoints offers a simple and secure option to deploy Machine Learning models on dedicated compute to be used in production on AWS. Inference Endpoints empower developers and data scientists alike to create AI applications without managing infrastructure: simplifying the deployment process to just a few clicks, including handling large volumes of requests with autoscaling, reducing infrastructure costs with scale-to-zero, and offering advanced security.
With Inference Endpoints, you’ll be able to deploy any of the 6 distilled models from DeepSeek-R1 and in addition a quantized version of DeepSeek R1 made by Unsloth: https://huggingface.co/unsloth/DeepSeek-R1-GGUF.
On the model page, click on Deploy, then on HF Inference Endpoints. You will likely be redirected to the Inference Endpoint page, where we chosen for you an optimized inference container, and the really helpful hardware to run the model. When you created your endpoint, you’ll be able to send your queries to DeepSeek R1 for 8.3$ per hour with AWS 🤯.
You will discover DeepSeek R1 and distilled models, in addition to other popular open LLMs, able to deploy on optimized configurations within the Inference Endpoints Model Catalog.
| Note: The team is working on enabling DeepSeek models deployment on Inferentia instances. Stay tuned!
Deploy on Amazon Bedrock Marketplace
You’ll be able to deploy the Deepseek distilled models on Amazon Bedrock via the marketplace, which can deploy an endpoint in Amazon SageMaker AI under the hood. Here’s a video of how you’ll be able to navigate through the AWS console:
Deploy on Amazon Sagemaker AI with Hugging Face LLM DLCs
DeepSeek R1 on GPUs
| Note: The team is working on enabling DeepSeek-R1 deployment on Amazon Sagemaker AI with the Hugging Face LLM DLCs on GPU. Stay tuned!
Distilled models on GPUs
You’ll be able to deploy the Deepseek distilled models on Amazon Sagemaker AI with Hugging Face LLM DLCs using Jumpstart directly or using the Python Sagemaker SDK.
Here’s a video of how you’ll be able to navigate through the AWS console:
Now we’ve got seen the way to deploy usig Jumpstart, let’s walk through the Python Sagemaker SDK deployment of DeepSeek-R1-Distill-Llama-70B.
Code snippets can be found on the model page under the Deploy button!
Before, let’s start with just a few pre-requisites. Make sur you’ve gotten a Sagemaker Domain configured, sufficient quota in Sagemaker, and a JupyterLab space. For DeepSeek-R1-Distill-Llama-70B, you must raise the default quota for ml.g6.48xlarge for endpoint usage to 1.
For reference, listed below are the hardware configurations we recommend you to make use of for every of the distilled variants:
| Model | Instance Type | # of GPUs per replica |
|---|---|---|
| deepseek-ai/DeepSeek-R1-Distill-Llama-70B | ml.g6.48xlarge | 8 |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | ml.g6.12xlarge | 4 |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | ml.g6.12xlarge | 4 |
| deepseek-ai/DeepSeek-R1-Distill-Llama-8B | ml.g6.2xlarge | 1 |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | ml.g6.2xlarge | 1 |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | ml.g6.2xlarge | 1 |
Once in a notebook, ensure to put in the most recent version of SageMaker SDK.
!pip install sagemaker --upgrade
Then, instantiate a sagemaker_session which is used to find out the present region and execution role.
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client("iam")
role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]
Create the SageMaker Model object with the Python SDK:
model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-70B"
model_name = hf_model_id.split("https://huggingface.co/")[-1].lower()
hub = {
"HF_MODEL_ID": model_id,
"SM_NUM_GPUS": json.dumps(8)
}
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface", version="3.0.1"),
env=hub,
role=role,
)
Deploy the model to a SageMaker endpoint and test the endpoint:
endpoint_name = f"{model_name}-ep"
predictor = huggingface_model.deploy(
endpoint_name=endpoint_name,
initial_instance_count=1,
instance_type="ml.g6.48xlarge",
container_startup_health_check_timeout=2400,
)
predictor.predict({"inputs": "What's the meaning of life?"})
That’s it, you deployed a Llama 70B reasoning model!
Because you might be using a TGI v3 container under the hood, probably the most performant parameters for the given hardware will likely be mechanically chosen.
Ensure that you delete the endpoint when you finished testing it.
predictor.delete_model()
predictor.delete_endpoint()
Distilled models on Neuron
Let’s walk through the deployment of DeepSeek-R1-Distill-Llama-70B on a Neuron instance, like AWS Trainium 2 and AWS Inferentia 2.
Code snippets can be found on the model page under the Deploy button!
The pre-requisites to deploy to a Neuron instance are the identical. Ensure that you’ve gotten a SageMaker Domain configured, sufficient quota in SageMaker, and a JupyterLab space. For DeepSeek-R1-Distill-Llama-70B, you must raise the default quota for ml.inf2.48xlarge for endpoint usage to 1.
Then, instantiate a sagemaker_session which is used to find out the present region and execution role.
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client("iam")
role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]
Create the SageMaker Model object with the Python SDK:
image_uri = get_huggingface_llm_image_uri("huggingface-neuronx", version="0.0.25")
model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-70B"
model_name = hf_model_id.split("https://huggingface.co/")[-1].lower()
hub = {
"HF_MODEL_ID": model_id,
"HF_NUM_CORES": "24",
"HF_AUTO_CAST_TYPE": "bf16",
"MAX_BATCH_SIZE": "4",
"MAX_INPUT_TOKENS": "3686",
"MAX_TOTAL_TOKENS": "4096",
}
huggingface_model = HuggingFaceModel(
image_uri=image_uri,
env=hub,
role=role,
)
Deploy the model to a SageMaker endpoint and test the endpoint:
endpoint_name = f"{model_name}-ep"
predictor = huggingface_model.deploy(
endpoint_name=endpoint_name,
initial_instance_count=1,
instance_type="ml.inf2.48xlarge",
container_startup_health_check_timeout=3600,
volume_size=512,
)
predictor.predict(
{
"inputs": "What's is the capital of France?",
"parameters": {
"do_sample": True,
"max_new_tokens": 128,
"temperature": 0.7,
"top_k": 50,
"top_p": 0.95,
}
}
)
That’s it, you deployed a Llama 70B reasoning model on a Neuron instance! Under the hood, it downloaded a pre-compiled model from Hugging Face to hurry up the endpoint start time.
Ensure that you delete the endpoint when you finished testing it.
predictor.delete_model()
predictor.delete_endpoint()
Deploy on EC2 Neuron with the Hugging Face Neuron Deep Learning AMI
This guide will detail the way to export, deploy and run DeepSeek-R1-Distill-Llama-70B on a inf2.48xlarge AWS EC2 Instance.
Before, let’s start with just a few pre-requisites. Ensure that you’ve gotten subscribed to the Hugging Face Neuron Deep Learning AMI on the Marketplace. It provides you all of the crucial dependencies to coach and deploy Hugging Face models on Trainium & Inferentia. Then, launch an inf2.48xlarge instance in EC2 with the AMI and connect through SSH. You’ll be able to check our step-by-step guide if you’ve gotten never done it.
Once connected through the instance, you’ll be able to deploy the model on an endpoint with this command:
docker run -p 8080:80
-v $(pwd)/data:/data
--device=/dev/neuron0
--device=/dev/neuron1
--device=/dev/neuron2
--device=/dev/neuron3
--device=/dev/neuron4
--device=/dev/neuron5
--device=/dev/neuron6
--device=/dev/neuron7
--device=/dev/neuron8
--device=/dev/neuron9
--device=/dev/neuron10
--device=/dev/neuron11
-e HF_BATCH_SIZE=4
-e HF_SEQUENCE_LENGTH=4096
-e HF_AUTO_CAST_TYPE="bf16"
-e HF_NUM_CORES=24
ghcr.io/huggingface/neuronx-tgi:latest
--model-id deepseek-ai/DeepSeek-R1-Distill-Llama-70B
--max-batch-size 4
--max-total-tokens 4096
It can take just a few minutes to download the compiled model from the Hugging Face cache and launch a TGI endpoint.
Then, you’ll be able to test the endpoint:
curl localhost:8080/generate
-X POST
-d '{"inputs":"Why is the sky dark at night?"}'
-H 'Content-Type: application/json'
Ensure that you pause the EC2 instance once you might be done testing it.
| Note: The team is working on enabling DeepSeek R1 deployment on Trainium & Inferentia with the Hugging Face Neuron Deep Learning AMI. Stay tuned!
Positive-tune DeepSeek R1 models
Positive tune on Amazon SageMaker AI with Hugging Face Training DLCs
| Note: The team is working on enabling all DeepSeek models tremendous tuning with the Hugging Face Training DLCs. Stay tuned!
Positive tune on EC2 Neuron with the Hugging Face Neuron Deep Learning AMI
| Note: The team is working on enabling all DeepSeek models tremendous tuning with the Hugging Face Neuron Deep Learning AMI . Stay tuned!





