with calculating an application’s performance is that the real-world performance and theoretical performance can differ. With an ecosystem of products that’s growing with high performance needs corresponding to High Performance Computing (HPC), gaming, or in the present landscape – Large Language Models (LLMs), it is crucial to calculate accurately the performance of an application.
Simply measuring theoretical GFLOPs (Floating-Point Operations Per Second) isn’t enough, as applications rarely reach these maximums in the true world. That is where the Roofline Model is available in, offering a transparent visual method to estimate an application’s performance and highlighting the critical role of hardware-specific optimizations.
Why easy metrics aren’t enough
Once we take into consideration measuring performance, there are just a few metrics that come to mind:
- Execution time: This tells you a task took but offers no insight into .
- Cycles per Instructions (CPI): This o measures the processor’s compute performance.
- Serial vs Parallel execution: Measures compute performance any hardware optimizations.
- Floating Point Operations Per Second (FLOP/s): This o represents a theoretical maximum which is commonly not achievable in a real-world scenario.
While these are good metrics, they typically don’t provide enough information. As an example, using the Floating Point Operations Per Seconds is a theoretical limit which isn’t often achieved. So using that because the metric isn’t enough because it ignores a typical performance limiter – data movement.
Roofline Modeling
The Roofline Model is a strong tool that visually maps an application’s performance against the capabilities of a selected hardware architecture, corresponding to a CPU or GPU. The model gets its name from the form of the graph it produces, which incorporates a “roof” composed of a slanted line and a flat, horizontal line. This shape represents the last word performance limits imposed by the hardware.
From this modeling technique, there are two parameters which define the achievable limits with hardware:
- Data movement: The time it takes to maneuver data, calculated as the whole data size divided by the system’s peak memory bandwidth.
- Computation: The time required for calculations, determined by dividing the whole variety of floating-point operations by the system’s peak compute performance (commonly measured in GFLOP/s).
The overall execution time of an application is decided by the greater of those two values: max {data_movement, computation}
.
Despite the hardware having higher compute performance, data movement can often turn out to be the bottleneck. Roofline Modeling introduces the concept of Arithmetic Intensity (AI). AI is the ratio of floating-point operations performed for each byte of knowledge moved from memory.
- An algorithm with high Arithmetic Intensity is taken into account compute-hungry. Its performance is proscribed by how quickly calculations may be performed.
- An algorithm with low Arithmetic Intensity is taken into account data-hungry. Its performance is proscribed by how quickly data may be moved.
Understanding the graph
Creative Commons Attribution-Share Alike 4.0 International
A Roofline graph plots the Attainable FLOP/s (y-axis) against the Arithmetic Intensity (x-axis). The “roof” itself shows the hardware’s limitations. The slanted a part of the roof represents the height data bandwidth (in GB/s), while the flat part represents the height computational performance (in GFLOPS). Note that the whole lot within the image is in a logarithmic scale.
- Points below the roof: Indicate suboptimal performance indicating scope of improvement.
- Points hitting the slanted line: Data hungry application. Its performance is proscribed by data bandwidth.
- Points hitting the flat line: Compute hungry application. It’s using the complete computational power of the processor.
Why is Roofline Modeling necessary?
Roofline Modeling provides a visible, intuitive option to understand application performance, showing key characteristics like Operational Intensity, GPU capabilities, and attainable FLOP/s. This type of modeling helps the programmer make targeted optimizations to their application for hardware with which higher results may be obtained.
- Bottleneck evaluation: Having a visible aid makes it easy for the developer to work out where the bottleneck is – memory or performance. If the applying is memory intensive, a developer can deal with improving data locality with techniques like caching or loop tiling. If it’s compute intensive, the main focus can shift to enabling more parallel computations or leveraging compiler optimizations.
- Hardware and software design: Software engineers mustn’t fear the underlying hardware. As an alternative, the hardware design must be embraced and optimized. Software engineers can use insights from Roofline Modeling to embrace and optimize for the particular architecture they’re using.
Roofline Modeling in Motion
To perform Roofline Modeling, we’d like to profile the applying to grasp the performance. From profiling, we will get metrics corresponding to Floating Point Operations (FLOPs) and memory bandwidth usage, each of that are required for Roofline Modeling. This text explores two of those tools – Nvidia’s ncu
which is the Nsight Compute CLI for GPU evaluation and PyTorch’s profiler, specifically for applications using PyTorch.
For detailed CUDA kernel optimization and precise FLOP/byte calculations, ncu
provides direct GPU hardware counter information. In contrast, torch.profiler.profile
offers a higher-level perspective inside PyTorch, helping within the understanding of operator-level performance, tensor memory usage, and the general application behavior encompassing each CPU and GPU activities.
Profiling with ncu
ncu
is the command line interface which is used for profiling CUDA kernels [2]. It could actually display results directly within the terminal or save them to a log file for later evaluation. To construct a Roofline model, we’d like to capture the particular metrics that can allow us to calculate Arithmetic Intensity.
We’ll use the PyTorch ImageNet repository [3] as our example. It’s selection since it’s easy to grasp, well-documented by PyTorch, and works with their profiler, so we will really dig into the performance.
Step 1: Run the ncu command to gather metrics
Step one is to run the applying through ncu to gather the vital hardware-level data. The command looks like this:
ncu --log-file
--metrics
--target-processes all
python3
- log-file: The log file through which we would like to store the outcomes.
- metrics: That is an important parameter and depicts the metrics that we would like to capture. For calculating Arithmetic Intensity, we consider:
dram__sectors_write.sum
: sum of DRAM sectors writtendram__sectors_read.sum
: sum of DRAM sectors readsmsp__sass_thread_inst_executed_op_fadd_pred_on.sum
: sum of floating-point additionssmsp__sass_thread_inst_executed_op_fmul_pred_on.sum
: sum of floating-point multiplicationssmsp__sass_thread_inst_executed_op_ffma_pred_on.sum
: sum of floating-point fused multiply add operations
- target-process:
all
flag ensures that we profile your complete application.
Our ncu command changes to:
ncu --log-file logs_example --metrics dram__sectors_write.sum,
dram__sectors_read.sum,
smsp__sass_thread_inst_executed_op_fadd_pred_on.sum,
smsp__sass_thread_inst_executed_op_fmul_pred_on.sum,
smsp__sass_thread_inst_executed_op_ffma_pred_on.sum
--target-processes all python3
important.py /imagenet --arch resnet50 --epochs 1 --batch-size 10
--print-freq 10 --seed 42
Step 2: Calculating FLOPs from the metrics
Once the profiler has run, we will aggregate the collected metrics to calculate the whole floating-point operations. The formula is:
[FLOPs = 2 * FMA_count + FADD_count + FMUL_count]
- FLOPs: Count of Floating Point Operations.
- FMA_count: Fused Multiply-Add (FMA) operations typically count as 2 FLOPs (one multiplication and one addition). That is represented by the
smsp__sass_thread_inst_executed_op_ffma_pred_on.sum
metric. - FADD_count: That is represented by the
smsp__sass_thread_inst_executed_op_fadd_pred_on.sum
metric. - FMUL_count: That is represented by the
smsp__sass_thread_inst_executed_op_fmul_pred_on.sum
metric.
Step 3: Calculate the bytes transferred
Next, we calculate the whole data transferred to and from DRAM. The ncu metrics provide the variety of DRAM sectors read and written. Assuming a typical sector size of 32 bytes for contemporary GPUs:
[Total_DRAM_bytes = (dram__sectors_read.sum + dram__sectors_write.sum) * 32]
Step 4: Calculate the Arithmetic Intensity
With FLOPs and total bytes, we will now calculate the Arithmetic Intensity:
[AI = FLOPs / Total_DRAM_Bytes]
Step 5: Calculate execution time
To search out the applying’s performance in FLOP/s, we also need the execution time. For this, we will use NVIDIA Nsight Systems (nsys), a system-wide profiler that may accurately measure the runtime of application segments. We run our application again, this time with nsys, to generate a time-based report. From this report, we will extract the whole GPU running time.
nsys profile -f true -o python3
Our nsys command changes to:
nsys profile -f true -o time.qdrep python3 important.py /imagenet
--arch resnet50 --epochs 1 --batch-size 10 --print-freq 10
--seed 42
After running this command, we will get the GPU_RUNNING_TIME
.
Step 6: Calculate the applying performance
Finally, we calculate the achieved performance in FLOP/s by dividing the whole FLOPs by the execution time:
[FLOP/s = FLOPs / GPU_RUNNING_TIME]
This value gives us the “attainable FLOP/s” that we will plot on our Roofline graph.
Profiling with torch
For applications written in PyTorch, the built-in torch.profiler.profile
offers a user-friendly option to gather performance data. There are 2 options which can be provided to the developers:
- Use the Profiler Context Manager
- Targeting Profiling for specific neural network layers
Profiler Context Manager
The a part of the code that we would like to profile may be wrapped throughout the with torch.profiler.profile()
context manager. Within the with
statement, you may define the activities
to trace (CPU, CUDA, or each), set a schedule
to profile specific training steps, and select whether to record tensor shapes, memory usage, or FLOPs. Once contained in the context, you have to call prof.step()
at the top of every iteration to signal the profiler to advance, especially when a schedule is used.
with profile(
activities=,
schedule=torch.profiler.schedule(),
record_shapes=,
profile_memory=,
with_flops=
) as prof:
....
prof.step()
- activities: Specify whether to profile the CPU, CUDA or each.
- schedule: Useful for profiling multiple steps within the training loop. If the schedule parameter is used, the profiler must call prof.step() to maneuver to the following step.
- record_shapes: Whether to record the shapes of the tensors.
- profile_memory: To capture memory usage
- with_flops: That is experimental but is used to FLOPs with operators.
Our profiler command changes to:
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=torch.profiler.schedule(wait=1, warmup=1, lively=3, repeat=2),
record_shapes=True,
profile_memory=True,
with_flops=True
) as prof:
Targeting Profiling for specific neural network layers
The profiler may also be utilized in a more targeted manner to investigate specific layers of a neural network. This is beneficial to examine whether some specific layer is contributing more to the performance than the opposite layers giving the developer the choice of modifying specific layers. While using this could be very easy to make use of, most often, the primary option works higher. The PyTorch profiler results may also be exported and visualized on a TensorBoard.
profiler.start()
self.conv2(x)
profiler.stop()
LLMs and Roofline Modeling
Coming to the subject everyone has been waiting for – does Roofline Modeling help with LLM performance calculation? The short answer is yes.
LLMs are complex neural network architectures with billions of parameters and the huge datasets that they process. While training is a really resource-intensive task, inference and high quality tuning the model also should be efficient.
- Bottlenecks: LLMs during inference can suffer from bottlenecks as a consequence of the sheer amount of parameters that it’s working with. These parameters are the weights of the models and so they cause memory bandwidth issues. Using Roofline Modeling, the precise layers may be profiled for the bottlenecks.
- Hardware selection: As most organizations fine-tune existing models quite than training them from scratch, selecting the correct infrastructure is crucial for managing costs. This underscores the importance of selecting optimal infrastructure for training. For instance, selecting the hardware based on your LLM architecture or optimizing your model to run on a selected architecture can cut training and inference costs.
Conclusion
The Roofline Model offers a strong visual evaluation of application performance optimization. By visualizing the applying performance across memory and compute, a transparent guidance is provided in selecting the very best option to approach optimizations. While this text only considered Naive Roofline Models, there are more advanced techniques corresponding to Hierarchical Roofline Models or adding ceilings for specific compute optimizations.
References
[1] https://docs.nersc.gov/tools/performance/roofline/
[2] https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html
[3] https://github.com/pytorch/examples/tree/important/imagenet