What Happens When You Construct an LLM Using Only 1s and 0s

Introduction

of Artificial Intelligence up until now has been defined by a straightforward, albeit expensive, rule: larger is all the time higher. As Large Language Models (LLMs) scale into the trillions of parameters, they show reasoning capabilities that were unimaginable just just a few years ago, and so they just keep improving.

Nevertheless, this growth has been hit with a physical reality. The energy and hardware required to run these models have gotten unsustainable, to the purpose where corporations like Google and Meta are exploring nuclear power solutions, just to fulfill their future energy demands (The Guardian)².

Larger is NOT At all times Higher

To combat this issue, the industry has relied on compression techniques and quantization. In easy terms, this involves taking a model trained in high precision (16-bit) and rounding its weights right down to lower precision (like 8-bit or 4-bit) for inference (Frantar et al., 2022)³. Though this method works, it remains to be a makeshift solution to the larger problem, because the model was never designed to be small in the primary place.

In a recent paper titled (Ma et al., 2024)¹, researchers from Microsoft propose a very different perspective on how LLMs are constructed. They introduce BitNet b1.58, which is an architecture that, as an alternative of just compressing a model, restricts the model to be trained within the extremely aggressive low-precision mode from the get-go. It forces the model to operate using only three possible values: {−1,0,1}. This text explores how such a severe restriction is feasible, the mathematical innovations behind the approach, and if this method may very well be a viable alternative to the expensive floating-point operations which might be de facto in modern AI.

The Architecture: Designing a 1-Bit Brain

To know the innovation of BitNet b1.58, we must take a look at the fundamental operation of a layer in a regular neural network. In modern LLMs, the nn.Linear layer stores information in a weight matrix of high-precision floating-point numbers (e.g., FP16/FP32). BitNet replaces this with a specialized layer, which uses just three integers to store the identical amount of data as any normal NN layer.

1. Achieving Ternary Weights

The core constraint of BitNet b1.58 is that each single parameter in the burden matrix of the network must resolve to one in every of three integers: {−1,0,1}. Unlike Post-Training Quantization, which compresses a model after it has been trained, BitNet enforces this constraint through the training process itself.

The authors utilize an Absmean Quantization function to map continuous values to this ternary set. The method involves the next two steps: scaling and rounding.

Scaling: The load matrix is first normalized by its average absolute value (). This ensures that the distribution of weights stays centered and consistent. The scaling factor might be calculated as below:

(Source: Writer)
n,m: Variety of rows and columns in matrix respectively.
W_ij: Parameter within the matrix at i^th row and j^th column.

Rounding: The scaled values are then rounded to the closest integer and clipped to make sure they fall strictly inside the range of [−1,1].

(Source: Writer)
W: Original weight matrix.
ϵ: Small value added to forestall zero-division errors.

3. The Training Paradox: The best way to Differentiate Integers

Essentially the most significant challenge that the authors faced in designing the one-bit architecture was the training process. Standard optimization algorithms, equivalent to Stochastic Gradient Descent (SGD) or Adam, depend on the concept of a continuous and differentiable landscape. They calculate the gradient of the loss function and adjust the weights by a tiny amount (e.g., 0.001) in the other way.

This creates a paradox:

How do you “nudge” an integer to include the changes suggested by the gradients?

For instance: If a weight is 1 and the gradient suggests moving it by −0.001, the result’s 0.999. If we implement integer states only, this value snaps right back to 1, the model never updates, and hence, it never learns.

BitNet solves this using a Latent Weight architecture (Bengio et al., 2013)⁵.

3.1 The Latent Weight Mechanism

(Source: Writer)
Flowchart depicting how the authors decouple ternary and master weights to enable model training.

The model maintains two versions of all of its parameters during training:

Master Weights (High-Precision): These are standard FP16/FP32 numbers that may capture small updates.
Quantized Weights (Ternary): These are the discrete {−1,0,1} values derived from the Master Weights, used for actual inference/forward-pass.

3.2 The Forward Pass

Through the forward pass, the master weights are first converted to ternary weights by the above-described operations (scaling and rounding). The model then uses these ternary weights to generate the output. This ensures that the model’s predictions are all the time representative of the constrained set of weights it has, as an alternative of the full-precision master weights.

3.3 The Backward Pass and Update

During backpropagation, the gradients flow backward, from the loss function. These gradients are then applied to the Master Weights, not the Ternary Weights.

This permits the Master Weights to build up small changes over many training steps. For instance, consider a Master Weight whose value is 0.4 (which corresponds to a 0 within the ternary set). After several updates, it’d shift to 0.45, then 0.49. It still rounds to 0, so the model’s behavior doesn’t change yet. Nevertheless, once it crosses the rounding threshold (e.g., reaching 0.51), it is going to then round to 1.

This mechanism allows the model to learn via standard gradient descent while still ensuring that the ultimate trained model consists exclusively of the efficient ternary weights.

2. Elimination of Matrix Multiplication

Essentially the most significant and immediate good thing about forcing weights into {−1,0,1} is the elimination of floating-point multiplication, which is the most costly operation in modern deep learning hardware.

(Source: Adapted from Ma et al., 2024¹, Figure 1)
Eliminating floating point numbers from weight matrices eliminates the necessity for floating point multiplications, which is the most costly and unabating operation for the GPUs.

In a regular Transformer (Vaswani et al., 2017)⁴, the GPU must perform billions of Multiply-Accumulate (MAC) operations, where a floating-point number is multiplied by one other floating-point number. Nevertheless, when one in every of the 2 inputs is restricted to the ternary set, multiplication ceases to exist:

Multiplication by 1 is just an addition ().
Multiplication by −1 is just a subtraction ().
Multiplication by 0 avoids computation entirely.

This architectural shift transforms all computation from complex floating-point multiplication operations into easy addition. This drastically reduces the energy footprint of the model, as integer addition is orders of magnitude cheaper to perform than floating-point multiplication.

Results: The Pareto Improvement

The first objective of the BitNet b1.58 research was not only to create a model that’s smaller in size, but additionally to prove that extreme quantization doesn’t have to come back at an expense of intelligence. The authors compared their architecture against FP16 LLaMA models (Touvron et al., 2023)⁶ on various downstream tasks, and observed some interesting findings:

1. Performance Parity with Full-Precision Models

Perhaps probably the most crucial finding is that the BitNet b1.58 model can perform on par with the usual FP16 models. When evaluated on zero-shot accuracy on benchmarks like ARC-Challenge, Hellaswag, and Winogrande, the b1.58 model demonstrated performance that is comparable to that of FP16 LLaMA models.

As evident from the table below, this parity begins to manifest strongly at the three billion parameter mark. While smaller models did struggle barely against the LLaMA baselines, BitNet b1.58 3B outperforms it on the common zero-shot accuracy. This lends credibility to the writer’s hypothesis that the ternary representation of weight matrices is sufficient to capture the nuances and intricacies of language modeling without the necessity for high-precision floating-point weights.

(Source: Adapted from Ma et al., 2024¹, Table 2)
For the smaller models (700M and 1.3B), BitNet still lagged behind the usual LLaMA models, but for the 3B variant, BitNet’s performance is virtually equivalent, if not superior in some benchmarks.

2. Redefining Latency and Memory Footprint

By reducing the burden precision from 16 bits right down to 1.58 bits, the memory footprint of the model training and inference has expectedly, yet drastically, lowered. As shown below, BitNet b1.58b requires 3.55x less GPU memory than its LLaMA counterpart at 3B parameter size. This reduction also alleviates the bandwidth bottleneck, which is a primary constraint during LLM inference.

A smaller memory footprint directly translates to latency as well. The authors observed a 2.71x reduction in inference latency for the 3B model size. Moreover, this gap in latency, between FP16 LLaMA and BitNet b1.58b, increases as we scale the model upwards. When each models are scaled to 70B parameters, the latency gap increases to 4.10x. This means a really promising scaling law, where the larger the model, the more it might probably profit from the BitNet architecture.

(Source: Adapted from Ma et al., 2024¹, Figure 1)
Latency and Memory, plotted against Model size. The gap between standard LLaMA and BitNet widens as we increase model size, which is an indication of an excellent scaling law.

3. Energy Consumption and Arithmetic Efficiency

Aside from the efficiency gains from reducing precision, we also get profound energy savings due to the elimination of floating-point multiplications. Through the use of ternary weights, BitNet relies on INT8 operations as an alternative of FP16, which reduces arithmetic energy costs.

The authors applied an energy model to estimate the fee of operations on 7nm chips. They observed that because the model size scales up, BitNet becomes increasingly efficient. For the reason that nn.Linear layers (where the vast majority of the savings occur) constitute a bigger percentage of the whole computation in larger models, the energy gap between standard LLaMA and BitNet grows with scale. For a 70B model, the end-to-end energy cost is greater than 41x lower, addressing probably the most distinguished environmental concerns concerning the deployment of large-scale AI models.

(Source: Adapted from Ma et al., 2024¹, Figure 3)
Plot of Energy vs Model Size. The combined effects of each: elimination of floating-point operations and aggressive quantization, yield enormous energy savings.

4. Throughput Maximization

In real-world production environments, throughput (tokens generated per second) is usually a more necessary metric than single-stream latency. On account of BitNet’s smaller memory overhead, it allows us to work with much larger batch sizes while using the identical GPUs.

On two 80GB A100 GPUs, the authors found that they may run a BitNet b1.58 70B model with a batch size 11 times larger than what was possible with FP16 LLaMA 70B. This resulted in an 8.9x increase in overall throughput. This finding is vital for production environments with serving infrastructure, implying that 1-bit LLMs could serve nearly nine times as many users as the present models using the identical hardware could do. This has an unlimited variety of use cases, equivalent to in real-time translation, autonomous driving cars, fast code generation, and lots of more.

(Source: Adapted from Ma et al., 2024¹, Table 3)
BitNet b1.58b accelerates training by allowing 11X the unique batch size, and accelerates token generation speed by nearly 9X.

.

Conclusion

As impressive as these results are, they still represent the least of the 1-bit architectures, not the most effective. It can be crucial to notice that the benchmarks and performance gains discussed above were run on hardware (NVIDIA A100s) that was designed for floating-point multiplication. Because of this we’re currently running BitNet b1.58 on chips that should not optimized to run INT8 additions, on top of which your complete architecture stands.

This means that there still exist some efficiency gains left unexplored. If BitNet can achieve an 8-9x speedup on hardware that’s suboptimal, then the potential gains on hardware that’s specifically designed for integer addition—equivalent to Groq’s LPUs—may very well be much more substantial.

This architecture also offers us a sensible pathway towards deploying large 70B+ parameter models, directly on local edge devices like mobile phones and laptops, without compromising intelligence.

References

[1] Ma, Shuming, et al. “The Era of 1-bit LLMs: All Large Language Models Are in 1.58 Bits.” , 27 Feb. 2024, arxiv.org/abs/2402.17764.
[2] . “Meta Signs Deal With Nuclear Plant to Power AI and Datacenters for 20 Years,” 4 June 2025, www.theguardian.com/technology/2025/jun/03/meta-nuclear-power-ai.
[3] Frantar, Elias, et al. “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.” , 31 Oct. 2022, arxiv.org/abs/2210.17323.
[4] Vaswani, Ashish, et al. “Attention Is All You Need.” , 12 June 2017, arxiv.org/abs/1706.03762.
[5] Bengio, Yoshua, et al. “Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation.” , 15 Aug. 2013, arxiv.org/abs/1308.3432.
[6] Touvron, Hugo, et al. “Llama 2: Open Foundation and Fantastic-Tuned Chat Models.” , 18 July 2023, arxiv.org/abs/2307.09288.

What Happens When You Construct an LLM Using Only 1s and 0s

Introduction

Larger is NOT At all times Higher