Enhancing Communication Observability of AI Workloads with NCCL Inspector

When using the NVIDIA Collective Communication Library (NCCL) to run a deep learning training or inference workload that uses collective operations (resembling AllReduce, AllGather, and ReduceScatter), it might be difficult to find out how NCCL is performing through the actual workload run.

This post introduces the NCCL Inspector Profiler Plugin, which addresses this problem. It offers a way for comprehensive, low-overhead always-on performance observability for distributed deep learning training and inference workloads.

Enhancing Communication Observability of AI Workloads with NCCL Inspector

What’s NCCL Inspector?

Key features of NCCL Inspector

Data collection phase

Example use (SLURM)

Verbose output

Data evaluation phase

Performance Summary Exporter

Dashboard integration

Use cases and applications

Performance evaluation

Research and development

Production monitoring

Start with NCCL Inspector

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Pre-Train BERT with Hugging Face Transformers and Habana Gaudi

Visualize proteins on Hugging Face Spaces

Towards open and responsible AI licensing frameworks

The Machine Learning Lessons I’ve Learned Last Month

The way to train a Language Model with Megatron-LM

Enhancing Communication Observability of AI Workloads with NCCL Inspector

What’s NCCL Inspector?

Key features of NCCL Inspector

Data collection phase

Example use (SLURM)

Verbose output

Data evaluation phase

Performance Summary Exporter

Dashboard integration

Use cases and applications

Performance evaluation

Research and development

Production monitoring

Start with NCCL Inspector

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.