Speed up a World of LLMs on Hugging Face with NVIDIA NIM

-


Neal Vaidya's avatar


AI builders desire a alternative of the most recent large language models (LLM) architectures and specialized variants to be used in AI agents and other apps, but handling all the variety can slow testing and deployment pipelines. Specifically, managing and optimizing different inference software frameworks to attain best performance across varied LLMs and serving requirements is a time-consuming bottleneck to getting performant AI apps within the hands of end-users.

NVIDIA AI customers and ecosystem partners leverage NVIDIA NIM inference microservices to streamline deployment of the most recent AI models on NVIDIA accelerated infrastructure, including LLMs, multi-modal and domain-specific models from NVIDIA, Meta, Mistral AI, Google and a whole lot more revolutionary model builders. We’ve seen customers and partners deliver more innovation, faster, with a simplified, reliable approach to model deployment, and today we’re excited to unlock over 100,000 LLMs on Hugging Face for rapid, reliable deployment with NIM.



A Single NIM Microservice for Deploying a Broad Range of LLMs

NIM now provides a single docker container for deploying a broad range of LLMs supported by leading inference frameworks from NVIDIA and the community including NVIDIA TensorRT-LLM, vLLM and SGLang. When an LLM is provided to the NIM container, it performs several steps for deployment and performance optimization, without manual configuration:

LLM Adaptation Phase What NIM Does
Model Evaluation NIM robotically identifies the model’s format, including Hugging Face models, TensorRT-LLM checkpoints, or pre-built TensorRT-LLM engines, ensuring compatibility.
Architecture and Quantization Detection It identifies the model’s architecture (e.g., Llama, Mistral) and quantization format (e.g., FP16, FP8, INT4).
Backend Selection Based on this evaluation, NIM selects an inference backend (NVIDIA TensorRT-LLM, vLLM, or SGLang).
Performance Setup NIM applies pre-configured settings for the chosen model and backend after which starts the inference server, reducing manual tuning efforts.

Table 1. NVIDIA NIM LLM adaptation phases and functionality

The only NIM container supports common LLM weight formats, including:

  • Hugging Face Transformers Checkpoints: LLMs will be deployed directly from Hugging Face repositories with.safetensors files, removing the necessity for complex conversions.
  • GGUF Checkpoints: Quantized GGUF checkpoints for supported model architectures will be deployed directly from HuggingFace or from locally downloaded files
  • TensorRT-LLM Checkpoints: Models packaged inside a trtllm_ckpt directory, optimized for TensorRT-LLM, will be deployed.
  • TensorRT-LLM Engines: Pre-built TensorRT-LLM engines from a trtllm_engine directory will be used for peak performance on NVIDIA GPUs.



Getting Began

To make use of NIM, ensure your environment has NVIDIA GPUs with appropriate drivers (CUDA 12.1+), Docker installed, an NVIDIA NGC Account and API Key for NIM Docker images, and a Hugging Face account and API token for models requiring authentication. Learn more about environment prerequisites within the NIM documentation.

Environment setup involves setting environment variables and making a persistent cache directory. Make sure the nim_cache directory has correct Unix permissions, ideally owned by the identical Unix user launching the Docker container, to stop permission issues. Commands use -u $(id -u) to administer this.

For ease of use, let’s store a number of the often used information in environment variables.


NIM_IMAGE=llm-nim

HF_TOKEN=



Example 1: Deploying a Model

Deploying an LLM from Hugging Face is demonstrated with Codestral-22B:

docker run --rm --gpus all 
  --shm-size=16GB 
  --network=host 
  -u $(id -u) 
  -v $(pwd)/nim_cache:/opt/nim/.cache 
  -v $(pwd):$(pwd) 
  -e HF_TOKEN=$HF_TOKEN 
  -e NIM_TENSOR_PARALLEL_SIZE=1 
  -e NIM_MODEL_NAME="hf://mistralai/Codestral-22B-v0.1" 
  $NIM_IMAGE

For locally downloaded models, point NIM_MODEL_NAME to the trail and mount the directory:

docker run --rm --gpus all 
  --shm-size=16GB 
  --network=host 
  -u $(id -u) 
  -v $(pwd)/nim_cache:/opt/nim/.cache 
  -v $(pwd):$(pwd) 
  -v /path/to/model/dir:/path/to/model/dir 
  -e HF_TOKEN=$HF_TOKEN 
  -e NIM_TENSOR_PARALLEL_SIZE=1 
  -e NIM_MODEL_NAME="/path/to/model/dir/mistralai-Codestral-22B-v0.1" 
  $NIM_IMAGE

While deploying a model, be at liberty to examine the output logs to get a way of the alternatives NIM made during model deployment. Deployed models can be found at http://localhost:8000, with API endpoints at http://localhost:8000/docs.

Additional arguments can be found by the underlying engine. You may inspect the total list of such arguments by running nim-run –help within the container, as shown below.

docker run --rm --gpus all 
  --network=host 
  -u $(id -u) 
  $NIM_IMAGE nim-run --help



Example 2: Specifying a Backend

To examine compatible backends or select a particular one, use list-model-profiles:

docker run --rm --gpus all 
  --shm-size=16GB 
  --network=host 
  -u $(id -u) 
  -v $(pwd)/nim_cache:/opt/nim/.cache 
  -v $(pwd):$(pwd) 
  -e HF_TOKEN=$HF_TOKEN 
  $NIM_IMAGE list-model-profiles --model "hf://meta-llama/Llama-3.1-8B-Instruct"

This command shows compatible profiles, including for LoRA adapters. To deploy with a particular backend like vLLM, use the NIM_MODEL_PROFILE environment variable, using the output supplied by list-model-profiles:

docker run --rm --gpus all 
  --shm-size=16GB 
  --network=host 
  -u $(id -u) 
  -v $(pwd)/nim_cache:/opt/nim/.cache 
  -v $(pwd):$(pwd) 
  -e HF_TOKEN=$HF_TOKEN 
  -e NIM_TENSOR_PARALLEL_SIZE=1 
  -e NIM_MODEL_NAME="hf://meta-llama/Llama-3.1-8B-Instruct" 
  -e NIM_MODEL_PROFILE="e2f00b2cbfb168f907c8d6d4d40406f7261111fbab8b3417a485dcd19d10cc98" 
  $NIM_IMAGE



Example 3: Quantized Model Deployment

NIM facilitates deploying quantized models. It robotically detects the quantization format (e.g., GGUF, AWQ) and selects the suitable backend using standard deployment commands:





docker run --rm --gpus all 
  --shm-size=16GB 
  --network=host 
  -u $(id -u) 
  -v $(pwd)/nim_cache:/opt/nim/.cache 
  -v $(pwd):$(pwd) 
  -e HF_TOKEN=$HF_TOKEN 
  -e NIM_TENSOR_PARALLEL_SIZE=1 
  -e NIM_MODEL_NAME=$MODEL 
  $NIM_IMAGE

For advanced users, NIM offers customization through environment variables similar to NIM_MAX_MODEL_LEN for context length. For giant LLMs, NIM_TENSOR_PARALLEL_SIZE enables multi-GPU deployment. Ensure --shm-size= is passed to Docker for multi-GPU communication.

The NIM container supports a broad range of LLMs supported by NVIDIA TensorRT-LLM, vLLM and SGLang, including popular LLMs and specialized variants on Hugging Face. For more details on supported LLMs, see the documentation.



Construct with Hugging Face and NVIDIA

NIM is designed to simplify AI model deployment on NVIDIA accelerated infrastructure, speeding innovation and time to value for top performance AI builders and enterprise AI teams. We look ahead to engagement and feedback from the Hugging Face Community.

Start with a developer example in an NVIDIA-hosted computing environment at construct.nvidia.com.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x