The sector of artificial intelligence (AI) has witnessed remarkable advancements lately, and at the guts of it lies the powerful combination of graphics processing units (GPUs) and parallel computing platform.
Models comparable to GPT, BERT, and more recently Llama, Mistral are able to understanding and generating human-like text with unprecedented fluency and coherence. Nevertheless, training these models requires vast amounts of information and computational resources, making GPUs and CUDA indispensable tools on this endeavor.
This comprehensive guide will walk you thru the means of establishing an NVIDIA GPU on Ubuntu, covering the installation of essential software components comparable to the NVIDIA driver, CUDA Toolkit, cuDNN, PyTorch, and more.
The Rise of CUDA-Accelerated AI Frameworks
GPU-accelerated deep learning has been fueled by the event of popular AI frameworks that leverage CUDA for efficient computation. Frameworks comparable to TensorFlow, PyTorch, and MXNet have built-in support for CUDA, enabling seamless integration of GPU acceleration into deep learning pipelines.
Based on the NVIDIA Data Center Deep Learning Product Performance Study, CUDA-accelerated deep learning models can achieve as much as 100s times faster performance in comparison with CPU-based implementations.
NVIDIA’s Multi-Instance GPU (MIG) technology, introduced with the Ampere architecture, allows a single GPU to be partitioned into multiple secure instances, each with its own dedicated resources. This feature enables efficient sharing of GPU resources amongst multiple users or workloads, maximizing utilization and reducing overall costs.
Accelerating LLM Inference with NVIDIA TensorRT
While GPUs have been instrumental in training LLMs, efficient inference is equally crucial for deploying these models in production environments. NVIDIA TensorRT, a high-performance deep learning inference optimizer and runtime, plays a significant role in accelerating LLM inference on CUDA-enabled GPUs.
Based on NVIDIA’s benchmarks, TensorRT can provide as much as 8x faster inference performance and 5x lower total cost of ownership in comparison with CPU-based inference for giant language models like GPT-3.
NVIDIA’s commitment to open-source initiatives has been a driving force behind the widespread adoption of CUDA within the AI research community. Projects like cuDNN, cuBLAS, and NCCL can be found as open-source libraries, enabling researchers and developers to leverage the total potential of CUDA for his or her deep learning.
Installation
When setting AI development, using the newest drivers and libraries may not all the time be the perfect alternative. As an illustration, while the newest NVIDIA driver (545.xx) supports CUDA 12.3, PyTorch and other libraries won’t yet support this version. Subsequently, we are going to use driver version 535.146.02 with CUDA 12.2 to make sure compatibility.
Installation Steps
1. Install NVIDIA Driver
First, discover your GPU model. For this guide, we use the NVIDIA GPU. Visit the NVIDIA Driver Download page, select the suitable driver in your GPU, and note the motive force version.
To ascertain for prebuilt GPU packages on Ubuntu, run:
sudo ubuntu-drivers list --gpgpu
Reboot your computer and confirm the installation:
nvidia-smi
2. Install CUDA Toolkit
The CUDA Toolkit provides the event environment for creating high-performance GPU-accelerated applications.
For a non-LLM/deep learning setup, you should use:
sudo apt install nvidia-cuda-toolkit Nevertheless, to make sure compatibility with BitsAndBytes, we are going to follow these steps: [code language="BASH"] git clone https://github.com/TimDettmers/bitsandbytes.git cd bitsandbytes/ bash install_cuda.sh 122 ~/local 1
Confirm the installation:
~/local/cuda-12.2/bin/nvcc --version
Set the environment variables:
export CUDA_HOME=/home/roguser/local/cuda-12.2/ export LD_LIBRARY_PATH=/home/roguser/local/cuda-12.2/lib64 export BNB_CUDA_VERSION=122 export CUDA_VERSION=122
3. Install cuDNN
Download the cuDNN package from the NVIDIA Developer website. Install it with:
sudo apt install ./cudnn-local-repo-ubuntu2204-8.9.7.29_1.0-1_amd64.deb
Follow the instructions so as to add the keyring:
sudo cp /var/cudnn-local-repo-ubuntu2204-8.9.7.29/cudnn-local-08A7D361-keyring.gpg /usr/share/keyrings/
Install the cuDNN libraries:
sudo apt update sudo apt install libcudnn8 libcudnn8-dev libcudnn8-samples
4. Setup Python Virtual Environment
Ubuntu 22.04 comes with Python 3.10. Install venv:
sudo apt-get install python3-pip sudo apt install python3.10-venv
Create and activate the virtual environment:
cd mkdir test-gpu cd test-gpu python3 -m venv venv source venv/bin/activate
5. Install BitsAndBytes from Source
Navigate to the BitsAndBytes directory and construct from source:
cd ~/bitsandbytes CUDA_HOME=/home/roguser/local/cuda-12.2/ LD_LIBRARY_PATH=/home/roguser/local/cuda-12.2/lib64 BNB_CUDA_VERSION=122 CUDA_VERSION=122 make cuda12x CUDA_HOME=/home/roguser/local/cuda-12.2/ LD_LIBRARY_PATH=/home/roguser/local/cuda-12.2/lib64 BNB_CUDA_VERSION=122 CUDA_VERSION=122 python setup.py install
6. Install PyTorch
Install PyTorch with the next command:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
7. Install Hugging Face and Transformers
Install the transformers and speed up libraries:
pip install transformers pip install speed up
The Power of Parallel Processing
At their core, GPUs are highly parallel processors designed to handle hundreds of concurrent threads efficiently. This architecture makes them well-suited for the computationally intensive tasks involved in training deep learning models, including LLMs. The CUDA platform, developed by NVIDIA, provides a software environment that enables developers to harness the total potential of those GPUs, enabling them to write down code that may leverage the parallel processing capabilities of the hardware.
Accelerating LLM Training with GPUs and CUDA.
Training large language models is a computationally demanding task that requires processing vast amounts of text data and performing quite a few matrix operations. GPUs, with their hundreds of cores and high memory bandwidth, are ideally suited to these tasks. By leveraging CUDA, developers can optimize their code to reap the benefits of the parallel processing capabilities of GPUs, significantly reducing the time required to coach LLMs.
For instance, the training of GPT-3, considered one of the most important language models so far, was made possible through using hundreds of NVIDIA GPUs running CUDA-optimized code. This allowed the model to be trained on an unprecedented amount of information, resulting in its impressive performance in natural language tasks.
import torch import torch.nn as nn import torch.optim as optim from transformers import GPT2LMHeadModel, GPT2Tokenizer # Load pre-trained GPT-2 model and tokenizer model = GPT2LMHeadModel.from_pretrained('gpt2') tokenizer = GPT2Tokenizer.from_pretrained('gpt2') # Move model to GPU if available device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = model.to(device) # Define training data and hyperparameters train_data = [...] # Your training data batch_size = 32 num_epochs = 10 learning_rate = 5e-5 # Define loss function and optimizer criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=learning_rate) # Training loop for epoch in range(num_epochs): for i in range(0, len(train_data), batch_size): # Prepare input and goal sequences inputs, targets = train_data[i:i+batch_size] inputs = tokenizer(inputs, return_tensors="pt", padding=True) inputs = inputs.to(device) targets = targets.to(device) # Forward pass outputs = model(**inputs, labels=targets) loss = outputs.loss # Backward pass and optimization optimizer.zero_grad() loss.backward() optimizer.step() print(f'Epoch {epoch+1}/{num_epochs}, Loss: {loss.item()}')
In this instance code snippet, we exhibit the training of a GPT-2 language model using PyTorch and the CUDA-enabled GPUs. The model is loaded onto the GPU (if available), and the training loop leverages the parallelism of GPUs to perform efficient forward and backward passes, accelerating the training process.
CUDA-Accelerated Libraries for Deep Learning
Along with the CUDA platform itself, NVIDIA and the open-source community have developed a variety of CUDA-accelerated libraries that enable efficient implementation of deep learning models, including LLMs. These libraries provide optimized implementations of common operations, comparable to matrix multiplications, convolutions, and activation functions, allowing developers to give attention to the model architecture and training process reasonably than low-level optimization.
One such library is cuDNN (CUDA Deep Neural Network library), which provides highly tuned implementations of normal routines utilized in deep neural networks. By leveraging cuDNN, developers can significantly speed up the training and inference of their models, achieving performance gains of as much as several orders of magnitude in comparison with CPU-based implementations.
import torch import torch.nn as nn import torch.nn.functional as F from torch.cuda.amp import autocast class ResidualBlock(nn.Module): def __init__(self, in_channels, out_channels, stride=1): super().__init__() self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False) self.bn1 = nn.BatchNorm2d(out_channels) self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False) self.bn2 = nn.BatchNorm2d(out_channels) self.shortcut = nn.Sequential() if stride != 1 or in_channels != out_channels: self.shortcut = nn.Sequential( nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False), nn.BatchNorm2d(out_channels)) def forward(self, x): with autocast(): out = F.relu(self.bn1(self.conv1(x))) out = self.bn2(self.conv2(out)) out += self.shortcut(x) out = F.relu(out) return out
On this code snippet, we define a residual block for a convolutional neural network (CNN) using PyTorch. The autocast context manager from PyTorch’s Automatic Mixed Precision (AMP) is used to enable mixed-precision training, which might provide significant performance gains on CUDA-enabled GPUs while maintaining high accuracy. The F.relu function is optimized by cuDNN, ensuring efficient execution on GPUs.
Multi-GPU and Distributed Training for Scalability
As LLMs and deep learning models proceed to grow in size and complexity, the computational requirements for training these models also increase. To handle this challenge, researchers and developers have turned to multi-GPU and distributed training techniques, which permit them to leverage the combined processing power of multiple GPUs across multiple machines.
CUDA and associated libraries, comparable to NCCL (NVIDIA Collective Communications Library), provide efficient communication primitives that enable seamless data transfer and synchronization across multiple GPUs, enabling distributed training at an unprecedented scale.
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
# Initialize distributed training
dist.init_process_group(backend=’nccl’, init_method=’…’)
local_rank = dist.get_rank()
torch.cuda.set_device(local_rank)
# Create model and move to GPU
model = MyModel().cuda()
# Wrap model with DDP
model = DDP(model, device_ids=[local_rank])
# Training loop (distributed)
for epoch in range(num_epochs):
for data in train_loader:
inputs, targets = data
inputs = inputs.cuda(non_blocking=True)
targets = targets.cuda(non_blocking=True)
outputs = model(inputs)
loss = criterion(outputs, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
In this instance, we exhibit distributed training using PyTorch’s DistributedDataParallel (DDP) module. The model is wrapped in DDP, which robotically handles data parallelism, gradient synchronization, and communication across multiple GPUs using NCCL. This approach enables efficient scaling of the training process across multiple machines, allowing researchers and developers to coach larger and more complex models in an affordable period of time.
Deploying Deep Learning Models with CUDA
While GPUs and CUDA have primarily been used for training deep learning models, also they are crucial for efficient deployment and inference. As deep learning models grow to be increasingly complex and resource-intensive, GPU acceleration is important for achieving real-time performance in production environments.
NVIDIA’s TensorRT is a high-performance deep learning inference optimizer and runtime that gives low-latency and high-throughput inference on CUDA-enabled GPUs. TensorRT can optimize and speed up models trained in frameworks like TensorFlow, PyTorch, and MXNet, enabling efficient deployment on various platforms, from embedded systems to data centers.
import tensorrt as trt # Load pre-trained model model = load_model(...) # Create TensorRT engine logger = trt.Logger(trt.Logger.INFO) builder = trt.Builder(logger) network = builder.create_network() parser = trt.OnnxParser(network, logger) # Parse and optimize model success = parser.parse_from_file(model_path) engine = builder.build_cuda_engine(network) # Run inference on GPU context = engine.create_execution_context() inputs, outputs, bindings, stream = allocate_buffers(engine) # Set input data and run inference set_input_data(inputs, input_data) context.execute_async_v2(bindings=bindings, stream_handle=stream.ptr) # Process output # ...
In this instance, we exhibit using TensorRT for deploying a pre-trained deep learning model on a CUDA-enabled GPU. The model is first parsed and optimized by TensorRT, which generates a highly optimized inference engine tailored for the particular model and hardware. This engine can then be used to perform efficient inference on the GPU, leveraging CUDA for accelerated computation.
Conclusion
The mix of GPUs and CUDA has been instrumental in driving the advancements in large language models, computer vision, speech recognition, and various other domains of deep learning. By harnessing the parallel processing capabilities of GPUs and the optimized libraries provided by CUDA, researchers and developers can train and deploy increasingly complex models with high efficiency.
As the sector of AI continues to evolve, the importance of GPUs and CUDA will only grow. With much more powerful hardware and software optimizations, we are able to expect to see further breakthroughs in the event and deployment of AI systems, pushing the boundaries of what is feasible.