Enhancing Communication Observability of AI Workloads with NCCL Inspector

-


When using the NVIDIA Collective Communication Library (NCCL) to run a deep learning training or inference workload that uses collective operations (resembling AllReduce, AllGather, and ReduceScatter), it might be difficult to find out how NCCL is performing through the actual workload run.

This post introduces the NCCL Inspector Profiler Plugin, which addresses this problem. It offers a way for comprehensive, low-overhead always-on performance observability for distributed deep learning training and inference workloads. 

What’s NCCL Inspector?

NCCL Inspector is a profiling and evaluation tool that gives detailed, per-communicator, per-collective performance and metadata logging. This tool features two essential steps: data collection and data evaluation.

NCCL Inspector may help answer questions on a wide range of topics, including: 

  • Intra-job collective performance comparison: How are AllReduce, AllGather, ReduceScatter, and other collectives performing within the Data Parallel domain in comparison with the Tensor Parallel domain?
  • Inter-job collective performance comparison: Did the congestion within the network yesterday cause collectives to perform poorly? Was it the explanation for the decrease in compute performance?
  • Compute-network performance correlation: If there may be an overall dip in compute performance (TFLOPs), was the network performance dip the cause?

The NCCL Inspector logs the collective bandwidth and duration of each rank within the communicator to disk at regular intervals. After the job has accomplished, this performance data is analyzed and correlated over the lifetime of the job. The performance of the NCCL collectives are then characterised through the lifetime of the multi-GPU job.

As a critical component for multi-GPU and multi-node communication, every framework using NCCL can profit from the detailed observability provided by NCCL Inspector.

NCCL Inspector leverages the plugin interface introduced in NCCL 2.23 to enable always-on observability for production workloads, while minimizing performance overheads.

In the course of the data collection step, the NCCL Inspector library instructs NCCL concerning the specific collective events it should emit. Users can load the library (for instance, DL frameworks) through the NCCL_PROFILER_PLUGIN environment variable. Then, NCCL Inspector listens to the subscribed events emitted by NCCL and generates structured JSON output for every of them, enabling deep insights into performance characteristics of NCCL collectives. 

Post-job completion evaluation and visualization are generated and executed through example Python scripts provided within the NCCL repo. This JSON output is later fed into evaluation scripts and various observability platforms to provide insight into the performance of NCCL during a production workload run.

Key features of NCCL Inspector

A few of the key standout features of NCCL Inspector that make it useful include:

  • Per-communicator tracking: NCCL Inspector maintains separate tracking for every NCCL communicator. This is especially invaluable in complex distributed applications like AI workloads where multiple communicators could also be used for various purposes like parallelism domains.
  • All the time-on low overhead: NCCL Inspector low-overhead performance tracking means it might be enabled in production workloads, providing “always-on” continuous observability of NCCL performance without significant performance degradation.
  • Performance metrics: NCCL Inspector calculates and reports key performance metrics including:
    • Algorithmic bandwidth
    • Bus bandwidth
    • Execution time in microseconds
    • Message sizes and collective types
  • Network technology agnostic: NCCL Inspector leverages the plugin interface to integrate with NCCL. It’s agnostic to numerous network technologies supported by NCCL (RoCE, IB, EFA, and so forth).

Data collection phase

For the information collection phase, NCCL Inspector is initialized through several environment variables.

Required variables:

  • NCCL_PROFILER_PLUGIN: Path to the plugin library binary.
  • NCCL_INSPECTOR_ENABLE=1
  • NCCL_INSPECTOR_DUMP_THREAD_INTERVAL_MICROSECONDS: Sets the interval for output writing

Optional variables:

  • NCCL_INSPECTOR_DUMP_DIR: Output directory for logs
  • NCCL_INSPECTOR_DUMP_VERBOSE(Optional): Enables verbose output with event trace information

Example use (SLURM)

To enable NCCL Inspector and begin the information collection phase, insert the next setting of environment variables to the SBATCH script in SLURM: 

export NCCL_PROFILER_PLUGIN=/path/to/nccl/ext-profiler/inspector/libnccl-profiler-inspector.so
export NCCL_INSPECTOR_ENABLE=1
export NCCL_INSPECTOR_DUMP_THREAD_INTERVAL_MICROSECONDS=500
export NCCL_INSPECTOR_DUMP_DIR=/path/to/logs/${SLURM_JOB_ID}/

srun your_nccl_application

Example output format

{
  "header": {
    "id": "0x7f8c496ae9f661", // communicator id
    "rank": 2,
    "n_ranks": 8,
    "nnodes": 1
  },
  "metadata": {
    "inspector_output_format_version": "v4.0",
    "git_rev": "",
    "rec_mechanism": "profiler_plugin",
    "dump_timestamp_us": 1748030377748202,
    "hostname": "hostname",
    "pid": 1639453
  },
  "coll_perf": {
    "coll": "AllReduce",
    "coll_sn": 1407,
    "coll_msg_size_bytes": 17179869184,
    "coll_exec_time_us": 61974,
    "coll_algobw_gbs": 277.210914,
    "coll_busbw_gbs": 485.119099
  }
}

Verbose output

When verbose mode is enabled with NCCL_INSPECTOR_DUMP_VERBOSE=1, the output per kernel (SM) performance is as follows:

{
  "header": {
    "id": "0xe62dedaa97644a", //communicator info
    "rank": 4, // communicator id
    "n_ranks": 8,
    "nnodes": 1
  },
  "metadata": {
    "inspector_output_format_version": "v4.0",
    "git_rev": "9019a1912-dirty",
    "rec_mechanism": "nccl_profiler_interface",
    "dump_timestamp_us": 1752867229276385,
    "hostname": "hostname",
    "pid": 438776
  },
  "coll_perf": {
    "coll": "ReduceScatter",
    "coll_sn": 1231,
    "coll_msg_size_bytes": 2147483648,
    "coll_exec_time_us": 41057,
    "coll_timing_source": "kernel_gpu",
    "coll_algobw_gbs": 418.439467,
    "coll_busbw_gbs": 366.134533,
    "event_trace_sn": {
      "coll_start_sn": 1,
      "coll_stop_sn": 2,
      "kernel_events": [
        {
          "channel_id": 0,
          "kernel_start_sn": 3,
          "kernel_stop_sn": 48,
          "kernel_record_sn": 47
        }
      ]
    },
    "event_trace_ts": {
      "coll_start_ts": 1752867229235059,
      "coll_stop_ts": 1752867229235064,
      "kernel_events": [
        {
          "channel_id": 0,
          "kernel_start_ts": 1752867229235181,
          "kernel_stop_ts": 1752867229275811,
          "kernel_record_ts": 1752867229275811
        }
      ]
    }
  }
}

Data evaluation phase

NCCL Inspector includes an example comprehensive performance evaluation and visualization tool that processes log files and generates detailed performance reports. The Performance Summary Exporter tool provides wealthy visualizations and statistical evaluation of collective communication performance.

Performance Summary Exporter 

This stand-alone Performance Summary Exporter is a Python-based evaluation tool positioned in ext-profiler/inspector/exporter/example/. This tool performs the next tasks:

  • Processes NCCL Inspector logs in multiple formats (.log, .log.gz, .jsonl, .jsonl.gz)
  • Exports data to Parquet format for efficient processing
  • Generates statistical summaries for collective operations
  • Creates visualizations including scatter plots, histograms, and box plots
  • Classifies communication patterns
    • single-rank
    • nvlink-only
    • hca-only
    • mixed

Dashboard integration

The NVIDIA team has integrated this data from NCCL Inspector to dashboards, which may give per SLURM job overview of NCCL performance. 

Graph showing IB Network Only ReduceScatter Collectives with Communicator Size 32, Message Size 8.39 MB, 8.40 MB, 5.85 MB.
Graph showing IB Network Only ReduceScatter Collectives with Communicator Size 32, Message Size 8.39 MB, 8.40 MB, 5.85 MB.
Figure 1. Per-job collective performance integration with elastic dashboard
Two graphs showing NVLink Only AllGather and ReduceScatter Collectives with Communicator Size 8, Message Size 50.33 MB, 64 B.Two graphs showing NVLink Only AllGather and ReduceScatter Collectives with Communicator Size 8, Message Size 50.33 MB, 64 B.
Figure 2. Per-collective type performance, for instance performance of NVLink only collectives used for tensor parallelism

Use cases and applications

You possibly can leverage NCCL Inspector for a variety of applications and use cases, including performance evaluation, research and development, and production monitoring.

Performance evaluation

The Inspector enables detailed evaluation of collective communication performance, helping discover bottlenecks and optimization opportunities in distributed training workloads.

Research and development

Researchers can use the detailed event traces and performance metrics to develop recent communication patterns and algorithms.

Production monitoring

The always-on nature of the Inspector makes it suitable for continuous monitoring of production workloads, providing insights into communication performance over time.

Start with NCCL Inspector

NCCL Inspector provides a strong tool for understanding and optimizing collective communication performance in distributed training workloads. Its low-overhead design makes it suitable for production use, while its detailed event tracing and performance metrics enable deep evaluation of communication patterns.

To start and learn more about NCCL and related tools, visit the NVIDIA/nccl NCCL GitHub repo and explore the NVIDIA Magnum IO documentation.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x