As deep learning models grow larger and datasets expand, practitioners face an increasingly common bottleneck: GPU memory bandwidth. While cutting-edge hardware offers FP8 precision to speed up training and inference, most data scientists and ML engineers work with older GPUs that lack this capability.
This gap within the ecosystem is what motivated me to construct Feather, an open-source library that utilises a software-based approach to deliver FP8-like performance improvements on widely available hardware. I created this tool to make efficient deep learning more accessible to the broader ML community, and I welcome contributions
Notation & Abbreviations
- FPX: X-bit floating point number
- UX: X-bit unsigned integer
- GPU: Graphics processing unit
- SRAM: Static RAM (on-chip GPU Cache)
- HBM: High bandwidth memory (GPU VRAM)
- GEMV: General Matrix-Vector multiplication
Motivation
FP8 processing has proven effective within the Deep Learning community [1]; nevertheless, only specific recent hardware architectures (Ada and Blackwell) support it, limiting its advantages for practitioners and researchers to utilise it. I actually have an “, which unfortunately doesn’t support FP8 operations on the hardware level.
Inspired by software-based solutions like (software-accelerated rendering on computers that don’t support native hardware acceleration for gaming), the article proposes an interesting solution that may utilise the facility of FP8 datatypes
Packing FP8 & FP16 in FP32 containers
Inspired by bitwise operations and packing techniques, the article presents an algorithm that packs two FP16s or 4 FP8s right into a single FP32. This enables for packing twice or 4 times the memory, benefiting from a lower memory footprint, while sacrificing only a small amount of precision.
One might argue that we’re performing redundant computation, “.” Nonetheless, consider Deep Learning operations; More often than not, these operations are memory-bound relatively than compute-bound. This is similar bottleneck that algorithms like FlashAttention address; nevertheless, FlashAttention utilises tiling to maintain data in fast SRAM, whereas Feather compresses data to cut back memory traffic.
GPU Memory Hierarchy
Take a have a look at this diagram. SRAM is the fastest accessible GPU memory region and has the very best bandwidth (excluding the register itself), but is proscribed to only 20MB. HBM will be viewed because the VRAM of the GPU itself, which has roughly 1/seventh the bandwidth of SRAM.
The GPU cores are fast enough to finish the computation immediately, but they spend most of their time sitting idle, waiting for the information to complete loading and writing back. That is what I mean by memory-bound: the bottleneck here isn’t the mathematics, but the information transfer between the hierarchy of memory within the GPU.
Lower Precision Types & Bandwidth
More often than not, values during computation are limited to ranges around zero on account of normalisation. Engineers developed lower-precision types resembling FP8 and FP16, which permit for higher bandwidth. One is perhaps confused about how lowering the precision allows for higher bandwidth. If we take a better look, we’re effectively loading two values within the place of 1 for the FP16 type and 4 values within the place of 1 for the FP8 type. We’re trading off precision for higher bandwidth to tackle memory-bound operations.
Hardware Level Support
Similar to AVX-512 instructions, that are supported only on a limited variety of hardware platforms, FP8 and FP16 instructions and registers are also limited by hardware and can be found only on the recent ones. If you happen to are on an RTX-30 or RTX-20 series GPU from Nvidia, you then is not going to give you the option to benefit from this lower precision FP8 type. This is precisely the issue that Feather attempts to unravel.
Packing Method
Using bitwise operators, one can easily pack the FP16 type right into a FP32. The algorithm is described below.
Packing FP16
- Forged the input FP32 right into a FP16; this step will be performed with ease using numpy’s function.
- Forged them to U16 after which to U32; this sets the upper 16 bits to 0s and lower 16 bits to the actual FP16.
- Shift one among them by 16 using the bitwise operator, and mix each of them using the bitwise operator.
Unpacking FP16
- Extract the lower 16 bits using the bitwise operator and mask 0xFFFF.
- Extract the upper 16 bits using the operation by 16 after which perform a bitwise operation with the mask 0xFFFF.
- Forged each U16 values back to FP16 and to FP32 if needed.
Packing FP8
FP8 has two widely used formats – E5M2 & E4M3. One cannot use the identical algorithm used for packing two FP16 into FP32 since the CPU doesn’t support FP8 types natively, but does for FP16 (half precision); that is the explanation that doesn’t exist.

Casting an FP16 to FP8-E5M2 is simple, as seen within the figure, because each have the identical variety of exponent bits and differ only of their fraction.
FP8-E5M2 Packing
- Forged the input FP32 right into a FP16; this step will be performed with ease using numpy’s function, or get the input itself as FP16.
- Forged to U16, by 8, then by 8 to isolate the upper 8 bits
- Do that for all 4 FP32s or FP16s.
- Now using the operator, shift them by 0, 8, 16 and 24 units and mix them using the bitwise operator.
Once more, unpacking needs to be straightforward; it’s the precise opposite of packing.
Packing an FP8-E4M3 just isn’t as easy and easy as packing an FP16 or FP8-E5M2, on account of the exponent bits mismatch.

As a substitute of implementing it from scratch, the library uses the library, which already does the casting math.
The library provides support for commonly used FP8 standards, resembling E5M2 and E4M3 casting, for NumPy arrays. Using the identical astype function, we will perform casting just as we did for FP16 types. The Algorithm is precisely an identical to how we pack FP16, so I’m skipping it here.
Triton GPU Kernels
After we pack, we’d like an algorithm (kernel) to utilise this packed datatype and perform the computation. Passing the packed datatype to a kernel implemented for FP32 or FP64 will lead to undefined computation because we’ve already corrupted the FP32 or FP64 being passed. Writing a kernel that takes the packed datatype as input in CUDA just isn’t a simple task and is error-prone. This is precisely where Triton shines; it’s a Domain-Specific Language library that leverages a custom intermediate representation for GPU kernels. In layman’s terms, it allows one to jot down GPU kernels in Python itself without the necessity to jot down CUDA kernels in C.
Triton kernels do exactly what was mentioned previously; the algorithm is as follows:
- Load the packed array into memory
- Unpack the memory and upcast it to FP32 for accumulation tasks
- Perform the computation
The reader should note that when performing the computation, upcasting is used to stop overflows. Due to this fact, from a computational perspective, there is no such thing as a advantage. Nonetheless, from the attitude of bandwidth, we’re loading memory twice or 4 times without compromising the bandwidth.
Triton Kernel Implementation (pseudocode)
@triton.jit
def gemv_fp8_kernel(packed_matrix_ptr, packed_vector_ptr, out_ptr):
# Get current row to process
row_id = get_program_id()
# Initialize accumulator for dot product
accumulator = 0
# Iterate over row in blocks
for every block in row:
# Load packed FP32 values (each accommodates 4 FP8s)
packed_matrix = load(packed_matrix_ptr)
packed_vector = load(packed_vector_ptr)
# Unpack the FP32 into 4 FP8 values
m_a, m_b, m_c, m_d = unpack_fp8(packed_matrix)
v_a, v_b, v_c, v_d = unpack_fp8(packed_vector)
# Upcast to FP32 and compute partial dot products
accumulator += (m_a * v_a) + (m_b * v_b) + (m_c * v_c) + (m_d * v_d)
# Store end result
store(out_ptr, accumulator)
Results
Hardware:
CUDA Version: 13.0
Python Version: 3.13.9
GEMV Benchmark (M = 16384, N = 16384) (MxN matrix)
| Implementation | Time (microseconds) | Speedup |
| Pytorch (FP32) | 5,635 | (Baseline) |
| Feather (FP8-E4M3) | 2,703 | 2.13x |
| Feather (FP8-E5M2) | 1,679 | 3.3x |
The theoretical performance boost that will be achieved is 4x; 3.3x is excellent as compared, with the remaining overhead primarily stemming from pack/unpack operations and kernel launch costs.
E5M2 is quicker than E4M3 on account of the simpler unpacking, but E4M3 offers higher precision. Nonetheless, it’s significantly more complex to unpack (Feather uses a separate GPU kernel to unpack the E4M3 format).
Flash Attention Benchmark (Sequence Length = 8192, Embedding Dimension = 512)
| Implementation | Time (microseconds) | Speedup |
| Pytorch (FP32) | 33,290 | (Baseline) |
| Feather (FP8-E5M2) | 9,887 | ~3.3x |
Accuracy & Precision
Testing with random matrices (integer distributions within the range [-3, 3] and standard normal distributions) shows that each E4M3 and E5M2 maintain numerical results inside practical tolerances for deep learning operations. The buildup errors remain manageable for typical workload sizes; nevertheless, users requiring strict numerical precision should validate their specific use case.
When must you use Feather?
Use cases for Feather usually are not limited; one can use Feather wherever FP8 packing and unpacking have a bonus, resembling
- Large matrix-vector products, where loading and unloading are the bottlenecks.
- Attention-like memory-bound kernels.
- Inference or fine-tuning on native RTX 30 or 20 series.
- Batch processing, where packing overhead is amortised
When must you not use Feather?
- You could have RTX 40-series or H100 GPUs (native FP8 is quicker).
- Workloads are compute-bound relatively than bandwidth- or memory-bound.
- You would like guaranteed precision.
Limitations of Feather
Feather is currently within the early stages of prototyping with several areas for improvement.
- Limited support for operations; currently, Feather supports only the dot product, GEMV subroutine and FlashAttention.
- Accuracy validation for complete ML workloads; currently, Feather’s accuracy is validated just for operations, not for end-to-end ML workloads.
- Integration is currently limited; Feather is a standalone implementation. Integration with PyTorch and support for autograd would make it more production-ready.
The project is open source; community contributions are welcome! You’ll be able to check out the code by simply following the instructions on GitHub.
Image License: All the photographs are made by the writer. Adaptation sources are clearly mentioned in respective captions.
