Data Format Fundamentals — Single Precision (FP32) vs Half Precision (FP16)
Now, let’s take a more in-depth take a look at FP32 and FP16 formats. The FP32 and FP16 are IEEE formats that represent floating numbers using 32-bit binary storage and 16-bit binary storage. Each formats comprise three parts: a) an indication bit, b) exponent bits, and c) mantissa bits. The FP32 and FP16 differ within the variety of bits allocated to exponent and mantissa, which result in several value ranges and precisions.
How do you change FP16 and FP32 to real values? In keeping with IEEE-754 standards, the decimal value for FP32 = (-1)^(sign) × 2^(decimal exponent —127 ) × (implicit leading 1 + decimal mantissa), where 127 is the biased exponent value. For FP16, the formula becomes (-1)^(sign) × 2^(decimal exponent — 15) × (implicit leading 1 + decimal mantissa), where 15 is the corresponding biased exponent value. See further details of the biased exponent value here.
On this sense, the worth range for FP32 is roughly [-2¹²⁷, 2¹²⁷] ~[-1.7*1e38, 1.7*1e38], and the worth range for FP16 is roughly [-2¹⁵, 2¹⁵]=[-32768, 32768]. Note that the decimal exponent for FP32 is between 0 and 255, and we’re excluding the biggest value 0xFF because it represents NAN. That’s why the biggest decimal exponent is 254–127 = 127. An analogous rule applies to FP16.
For the precision, note that each the exponent and mantissa contributes to the precision limits (which can also be called denormalization, see detailed discussion here), so FP32 can represent precision as much as 2^(-23)*2^(-126)=2^(-149), and FP16 can represent precision as much as 2^(10)*2^(-14)=2^(-24).
The difference between FP32 and FP16 representations brings the important thing concerns of mixed precision training, as different layers/operations of deep learning models are either insensitive or sensitive to value ranges and precision and must be addressed individually.