Setting Up a Training, Effective-Tuning, and Inferencing of LLMs with NVIDIA GPUs and CUDA

-

The sector of artificial intelligence (AI) has witnessed remarkable advancements lately, and at the guts of it lies the powerful combination of graphics processing units (GPUs) and parallel computing platform.

Models comparable to GPT, BERT, and more recently Llama, Mistral are able to understanding and generating human-like text with unprecedented fluency and coherence. Nevertheless, training these models requires vast amounts of information and computational resources, making GPUs and CUDA indispensable tools on this endeavor.

This comprehensive guide will walk you thru the means of establishing an NVIDIA GPU on Ubuntu, covering the installation of essential software components comparable to the NVIDIA driver, CUDA Toolkit, cuDNN, PyTorch, and more.

The Rise of CUDA-Accelerated AI Frameworks

GPU-accelerated deep learning has been fueled by the event of popular AI frameworks that leverage CUDA for efficient computation. Frameworks comparable to TensorFlow, PyTorch, and MXNet have built-in support for CUDA, enabling seamless integration of GPU acceleration into deep learning pipelines.

Based on the NVIDIA Data Center Deep Learning Product Performance Study, CUDA-accelerated deep learning models can achieve as much as 100s times faster performance in comparison with CPU-based implementations.

NVIDIA’s Multi-Instance GPU (MIG) technology, introduced with the Ampere architecture, allows a single GPU to be partitioned into multiple secure instances, each with its own dedicated resources. This feature enables efficient sharing of GPU resources amongst multiple users or workloads, maximizing utilization and reducing overall costs.

Accelerating LLM Inference with NVIDIA TensorRT

While GPUs have been instrumental in training LLMs, efficient inference is equally crucial for deploying these models in production environments. NVIDIA TensorRT, a high-performance deep learning inference optimizer and runtime, plays a significant role in accelerating LLM inference on CUDA-enabled GPUs.

Based on NVIDIA’s benchmarks, TensorRT can provide as much as 8x faster inference performance and 5x lower total cost of ownership in comparison with CPU-based inference for giant language models like GPT-3.

NVIDIA’s commitment to open-source initiatives has been a driving force behind the widespread adoption of CUDA within the AI research community. Projects like cuDNN, cuBLAS, and NCCL can be found as open-source libraries, enabling researchers and developers to leverage the total potential of CUDA for his or her deep learning.

Installation

When setting  AI development, using the newest drivers and libraries may not all the time be the perfect alternative. As an illustration, while the newest NVIDIA driver (545.xx) supports CUDA 12.3, PyTorch and other libraries won’t yet support this version. Subsequently, we are going to use driver version 535.146.02 with CUDA 12.2 to make sure compatibility.

Installation Steps

1. Install NVIDIA Driver

First, discover your GPU model. For this guide, we use the NVIDIA GPU. Visit the NVIDIA Driver Download page, select the suitable driver in your GPU, and note the motive force version.

To ascertain for prebuilt GPU packages on Ubuntu, run:

sudo ubuntu-drivers list --gpgpu

Reboot your computer and confirm the installation:

nvidia-smi

2. Install CUDA Toolkit

The CUDA Toolkit provides the event environment for creating high-performance GPU-accelerated applications.

For a non-LLM/deep learning setup, you should use:

sudo apt install nvidia-cuda-toolkit
Nevertheless, to make sure compatibility with BitsAndBytes, we are going to follow these steps:
[code language="BASH"]
git clone https://github.com/TimDettmers/bitsandbytes.git
cd bitsandbytes/
bash install_cuda.sh 122 ~/local 1

Confirm the installation:

~/local/cuda-12.2/bin/nvcc --version

Set the environment variables:

export CUDA_HOME=/home/roguser/local/cuda-12.2/
export LD_LIBRARY_PATH=/home/roguser/local/cuda-12.2/lib64
export BNB_CUDA_VERSION=122
export CUDA_VERSION=122

3. Install cuDNN

Download the cuDNN package from the NVIDIA Developer website. Install it with:

sudo apt install ./cudnn-local-repo-ubuntu2204-8.9.7.29_1.0-1_amd64.deb

Follow the instructions so as to add the keyring:

sudo cp /var/cudnn-local-repo-ubuntu2204-8.9.7.29/cudnn-local-08A7D361-keyring.gpg /usr/share/keyrings/

Install the cuDNN libraries:

sudo apt update
sudo apt install libcudnn8 libcudnn8-dev libcudnn8-samples

4. Setup Python Virtual Environment

Ubuntu 22.04 comes with Python 3.10. Install venv:

sudo apt-get install python3-pip
sudo apt install python3.10-venv

Create and activate the virtual environment:

cd
mkdir test-gpu
cd test-gpu
python3 -m venv venv
source venv/bin/activate

5. Install BitsAndBytes from Source

Navigate to the BitsAndBytes directory and construct from source:

cd ~/bitsandbytes
CUDA_HOME=/home/roguser/local/cuda-12.2/ 
LD_LIBRARY_PATH=/home/roguser/local/cuda-12.2/lib64 
BNB_CUDA_VERSION=122 
CUDA_VERSION=122 
make cuda12x
CUDA_HOME=/home/roguser/local/cuda-12.2/ 
LD_LIBRARY_PATH=/home/roguser/local/cuda-12.2/lib64 
BNB_CUDA_VERSION=122 
CUDA_VERSION=122 
python setup.py install

6. Install PyTorch

Install PyTorch with the next command:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

7. Install Hugging Face and Transformers

Install the transformers and speed up libraries:

pip install transformers
pip install speed up

The Power of Parallel Processing

At their core, GPUs are highly parallel processors designed to handle hundreds of concurrent threads efficiently. This architecture makes them well-suited for the computationally intensive tasks involved in training deep learning models, including LLMs. The CUDA platform, developed by NVIDIA, provides a software environment that enables developers to harness the total potential of those GPUs, enabling them to write down code that may leverage the parallel processing capabilities of the hardware.
Accelerating LLM Training with GPUs and CUDA.

Training large language models is a computationally demanding task that requires processing vast amounts of text data and performing quite a few matrix operations. GPUs, with their hundreds of cores and high memory bandwidth, are ideally suited to these tasks. By leveraging CUDA, developers can optimize their code to reap the benefits of the parallel processing capabilities of GPUs, significantly reducing the time required to coach LLMs.

For instance, the training of GPT-3, considered one of the most important language models so far, was made possible through using hundreds of NVIDIA GPUs running CUDA-optimized code. This allowed the model to be trained on an unprecedented amount of information, resulting in its impressive performance in natural language tasks.

import torch
import torch.nn as nn
import torch.optim as optim
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Load pre-trained GPT-2 model and tokenizer
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# Define training data and hyperparameters
train_data = [...] # Your training data
batch_size = 32
num_epochs = 10
learning_rate = 5e-5
# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# Training loop
for epoch in range(num_epochs):
for i in range(0, len(train_data), batch_size):
# Prepare input and goal sequences
inputs, targets = train_data[i:i+batch_size]
inputs = tokenizer(inputs, return_tensors="pt", padding=True)
inputs = inputs.to(device)
targets = targets.to(device)
# Forward pass
outputs = model(**inputs, labels=targets)
loss = outputs.loss
# Backward pass and optimization
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f'Epoch {epoch+1}/{num_epochs}, Loss: {loss.item()}')

In this instance code snippet, we exhibit the training of a GPT-2 language model using PyTorch and the CUDA-enabled GPUs. The model is loaded onto the GPU (if available), and the training loop leverages the parallelism of GPUs to perform efficient forward and backward passes, accelerating the training process.

CUDA-Accelerated Libraries for Deep Learning

Along with the CUDA platform itself, NVIDIA and the open-source community have developed a variety of CUDA-accelerated libraries that enable efficient implementation of deep learning models, including LLMs. These libraries provide optimized implementations of common operations, comparable to matrix multiplications, convolutions, and activation functions, allowing developers to give attention to the model architecture and training process reasonably than low-level optimization.

One such library is cuDNN (CUDA Deep Neural Network library), which provides highly tuned implementations of normal routines utilized in deep neural networks. By leveraging cuDNN, developers can significantly speed up the training and inference of their models, achieving performance gains of as much as several orders of magnitude in comparison with CPU-based implementations.

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.cuda.amp import autocast
class ResidualBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride=1):
super().__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(out_channels)
self.shortcut = nn.Sequential()
if stride != 1 or in_channels != out_channels:
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
nn.BatchNorm2d(out_channels))
def forward(self, x):
with autocast():
out = F.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out += self.shortcut(x)
out = F.relu(out)
return out

On this code snippet, we define a residual block for a convolutional neural network (CNN) using PyTorch. The autocast context manager from PyTorch’s Automatic Mixed Precision (AMP) is used to enable mixed-precision training, which might provide significant performance gains on CUDA-enabled GPUs while maintaining high accuracy. The F.relu function is optimized by cuDNN, ensuring efficient execution on GPUs.

Multi-GPU and Distributed Training for Scalability

As LLMs and deep learning models proceed to grow in size and complexity, the computational requirements for training these models also increase. To handle this challenge, researchers and developers have turned to multi-GPU and distributed training techniques, which permit them to leverage the combined processing power of multiple GPUs across multiple machines.

CUDA and associated libraries, comparable to NCCL (NVIDIA Collective Communications Library), provide efficient communication primitives that enable seamless data transfer and synchronization across multiple GPUs, enabling distributed training at an unprecedented scale.


import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
# Initialize distributed training
dist.init_process_group(backend=’nccl’, init_method=’…’)
local_rank = dist.get_rank()
torch.cuda.set_device(local_rank)
# Create model and move to GPU
model = MyModel().cuda()
# Wrap model with DDP
model = DDP(model, device_ids=[local_rank])
# Training loop (distributed)
for epoch in range(num_epochs):
for data in train_loader:
inputs, targets = data
inputs = inputs.cuda(non_blocking=True)
targets = targets.cuda(non_blocking=True)
outputs = model(inputs)
loss = criterion(outputs, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()

In this instance, we exhibit distributed training using PyTorch’s DistributedDataParallel (DDP) module. The model is wrapped in DDP, which robotically handles data parallelism, gradient synchronization, and communication across multiple GPUs using NCCL. This approach enables efficient scaling of the training process across multiple machines, allowing researchers and developers to coach larger and more complex models in an affordable period of time.

Deploying Deep Learning Models with CUDA

While GPUs and CUDA have primarily been used for training deep learning models, also they are crucial for efficient deployment and inference. As deep learning models grow to be increasingly complex and resource-intensive, GPU acceleration is important for achieving real-time performance in production environments.

NVIDIA’s TensorRT is a high-performance deep learning inference optimizer and runtime that gives low-latency and high-throughput inference on CUDA-enabled GPUs. TensorRT can optimize and speed up models trained in frameworks like TensorFlow, PyTorch, and MXNet, enabling efficient deployment on various platforms, from embedded systems to data centers.

import tensorrt as trt
# Load pre-trained model
model = load_model(...)
# Create TensorRT engine
logger = trt.Logger(trt.Logger.INFO)
builder = trt.Builder(logger)
network = builder.create_network()
parser = trt.OnnxParser(network, logger)
# Parse and optimize model
success = parser.parse_from_file(model_path)
engine = builder.build_cuda_engine(network)
# Run inference on GPU
context = engine.create_execution_context()
inputs, outputs, bindings, stream = allocate_buffers(engine)
# Set input data and run inference
set_input_data(inputs, input_data)
context.execute_async_v2(bindings=bindings, stream_handle=stream.ptr)
# Process output
# ...

In this instance, we exhibit using TensorRT for deploying a pre-trained deep learning model on a CUDA-enabled GPU. The model is first parsed and optimized by TensorRT, which generates a highly optimized inference engine tailored for the particular model and hardware. This engine can then be used to perform efficient inference on the GPU, leveraging CUDA for accelerated computation.

Conclusion

The mix of GPUs and CUDA has been instrumental in driving the advancements in large language models, computer vision, speech recognition, and various other domains of deep learning. By harnessing the parallel processing capabilities of GPUs and the optimized libraries provided by CUDA, researchers and developers can train and deploy increasingly complex models with high efficiency.

As the sector of AI continues to evolve, the importance of GPUs and CUDA will only grow. With much more powerful hardware and software optimizations, we are able to expect to see further breakthroughs in the event and deployment of  AI systems, pushing the boundaries of what is feasible.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x