CUDA

Optimizing Token Generation in PyTorch Decoder Models

which have pervaded nearly every facet of our day by day lives are autoregressive decoder models. These models apply compute-heavy kernel operations to churn out tokens one after the other in a way...

AI in Multiple GPUs: Understanding the Host and Device Paradigm

is an element of a series about distributed AI across multiple GPUs: Part 1: Understanding the Host and Device Paradigm (this text) Part 2: Point-to-Point and Collective Operations Part 3: How GPUs Communicate Part 4: Gradient...

Pipelining AI/ML Training Workloads with CUDA Streams

ninth in our series on performance profiling and optimization in PyTorch aimed toward emphasizing the critical role of performance evaluation and optimization in machine learning development. Throughout the series we've reviewed a wide selection of practical...

Master CUDA: For Machine Learning Engineers

CUDA for Machine Learning: Practical ApplicationsStructure of a CUDA C/C++ application, where the host (CPU) code manages the execution of parallel code on the device (GPU).Now that we have covered the fundamentals, let's explore...

Setting Up a Training, Effective-Tuning, and Inferencing of LLMs with NVIDIA GPUs and CUDA

The sector of artificial intelligence (AI) has witnessed remarkable advancements lately, and at the guts of it lies the powerful combination of graphics processing units (GPUs) and parallel computing platform.Models comparable to GPT, BERT,...

Recent posts

Popular categories

ASK ANA