CUDA

Artificial Intelligence

Optimizing Token Generation in PyTorch Decoder Models

which have pervaded nearly every facet of our day by day lives are autoregressive decoder models. These models apply compute-heavy kernel operations to churn out tokens one after the other in a way...

ASK ANA - February 24, 2026

Artificial Intelligence

AI in Multiple GPUs: Understanding the Host and Device Paradigm

is an element of a series about distributed AI across multiple GPUs: Part 1: Understanding the Host and Device Paradigm (this text) Part 2: Point-to-Point and Collective Operations Part 3: How GPUs Communicate Part 4: Gradient...

ASK ANA - February 12, 2026

Artificial Intelligence

Pipelining AI/ML Training Workloads with CUDA Streams

ninth in our series on performance profiling and optimization in PyTorch aimed toward emphasizing the critical role of performance evaluation and optimization in machine learning development. Throughout the series we've reviewed a wide selection of practical...

ASK ANA - June 27, 2025

Artificial Intelligence

Master CUDA: For Machine Learning Engineers

CUDA for Machine Learning: Practical ApplicationsStructure of a CUDA C/C++ application, where the host (CPU) code manages the execution of parallel code on the device (GPU).Now that we have covered the fundamentals, let's explore...

ASK ANA - September 4, 2024

Artificial Intelligence

Popular categories

Artificial Intelligence10756 New Post1 My Blog1

CUDA

Recent posts

Statement from Dario Amodei on our discussions with the Department of War Anthropic

Google quantum-proofs HTTPS by squeezing 2.5kB of information into 64-byte space – Ars Technica

Generative AI, Discriminative Human

Featured video: Coding for underwater robotics

Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM

Popular categories