A series of powerful, universal, fine-tunable 1.58bit language models.

-



image/png

On this blogpost, we present the important thing highlights and rationales concerning the Falcon-Edge series – a set of powerful, universal, and fine-tunable language models available in ternary format, based on the BitNet architecture.

Drawing from our experience with BitNet, Falcon-Edge introduces and validates an latest pre-training paradigm that delivers a full-scope output from a single training process, concurrently yielding each non-quantized and quantized model variants. This comprehensive approach produces a non-BitNet model in bfloat16 format, the native BitNet model, and a pre-quantized BitNet variant specifically engineered for effortless fine-tuning, enabling users and developers to exactly tailor these models to their specific applications and wishes.

Available now in two sizes—1 Billion and three Billion parameters—each size is available in each base and instruction-tuned models. Discover the Falcon-Edge series on our dedicated Hugging Face collection.



Introduction

Large Language Models (LLMs), by design, are inherently large and resource-intensive. As demand grows to deploy these models efficiently on edge devices, research into model compression has accelerated. Recent efforts, resembling those by DeepSeek and Llama 4, explore training with reduced precision formats—all the way down to FP8—to enhance deployment scalability. However, many state-of-the-art methods emphasize post-training quantization. In contrast to those approaches, BitNet introduces a fundamentally different paradigm: unlike reduced-precision training which still relies on floating-point formats, and post-training quantization which adjusts weights after full-precision training, BitNet operates with the bottom possible precision — ternary weights ({-1, 0, 1}) — directly during training, enabling an end-to-end ultra-efficient model design.

These ternary weights are paving the way in which for a “matmul-free” LLM design that’s notably faster and remarkably memory-efficient in practice. The first challenge of this modern approach is the need for pre-training BitNet models, which will be computationally demanding and dear for typical users.



Falcon-Edge, a series of powerful models

Leveraging the learnings from pre-training data strategies from our center, we pre-train our model on an internal data mixture for roughly 1.5 Tera Tokens. We use the classic WSD learning rate scheduler for pre-training.

We evaluate our models (base and instruct versions) on the previous Hugging Face leaderboard v2 benchmark and report the normalized results below compared with other models of comparable size:

image/png

image/png

Additional results (leaderboard v1) on comparing our instructed models with Microsoft’s latest BitNet model:

image/png

Falcon-Edge demonstrates on-par and higher performances than models of comparable sizes on the leaderboard v2 tasks, demonstrating that it is feasible to coach powerful BitNet models on desired domains while being competitive enough on other tasks.



Falcon-Edge, a series of universal models

If we glance closer on the formula of the BitNet linear layer for inference (by way of Python code):

def activation_norm_quant(x):
    scale = 127.0 / x.abs().max(dim=-1, keepdim=True).values.clamp_(min=1e-5)
    y = (x * scale).round().clamp_(-128, 127)
    return y, scale

class BitLinear(nn.Linear):
    
    def post_quant_process(self, input, input_scale, weight_scale):
        out = input / (input_scale * weight_scale)
        return out

    def forward(self, input):
        w = self.weight
        w_quant = unpack_weights(w, dtype=self.dtype)
        input_quant, input_scale = self.activation_quant(input)
        y = F.linear(input_quant.to(self.dtype), w_quant)
        y = self.post_quant_process(y, self.weight_scale, input_scale)
        if self.bias is not None:
            y += self.bias.view(1, -1).expand_as(y)
        return y

The normalization activation_norm_quant quantizes the activations in int8 format, then the activation is computed back in half precision by diving it by x_scale. Because the model has been trained with fake 8-bit activation quantization, we argue that it is feasible to approximate that:

x_quant, x_scale = activation_norm_quant(x)
x ~= (x_quant / x_scale)

Subsequently, as an alternative of quantizing the model post-training, injecting the load scale after quantizing the weights should result in a very good enough “approximation” of the non-BitNet version of the model:

def _weight_quant(w):
    scale = 1.0 / w.abs().mean().clamp_(min=1e-05)
    u = (w * scale).round().clamp_(-1, 1)
    return u, scale

for param_name, param_value in state_dict.items():
    if _is_param_to_not_quantize(param_name):
        proceed

    param_value, param_scale = _weight_quant(param_value)
    param_value = param_value / param_scale

    state_dict_quant[param_name] = param_value

We confirm this by running end-to-end evaluations on the bfloat16 variant of our 1B and 3B base models and below are the outcomes:

image/png

The bfloat16 counterparts of the models will be loaded directly via Hugging Face transformers by passing revision="bfloat16" within the from_pretrained function:

import torch

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer

model_id = "tiiuae/Falcon-E-1B-Base"

tokenizer = AutoTokenizer.from_pretrained(model_id, revision="prequantized")
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    revision="bfloat16"
)



Falcon-Edge, a series of fine-tunable Bitnet models

To the very best of our knowledge, except from probably the most recent release from Microsoft previous BitNet releases only deal with releasing the ultimate quantized model, making it usable just for inference. Similarly to the discharge from Microsoft, we propose to increase the accessibility of research and application of BitNet models by releasing their pre-quantized weights. That way, users can either perform fine-tuning on their goal domain, or do continuous pre-training of the BitNet checkpoint so long as nn.Linear layers are replaced by BitnetLinear layers, and by ensuring to quantize the model post training in BitNet format. Because the weights corresponds to the pre-quantized weights, performing text generation without replacing the nn.Linear layers with BitnetLinear layers will produce gibberish output.

The pre-quantized weights will be downloaded via Hugging Face’s transformers library by specifying the revision argument to be prequantized:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "tiiuae/Falcon-E-1B-Base"

tokenizer = AutoTokenizer.from_pretrained(model_id, revision="prequantized")
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    revision="prequantized"
)

This manner, we’ll help fostering an ecosystem around first powerful 1-bit fine-tunes by the community. We offer to community the tools to get easily began and fine-tune their very own version of powerful BitNet models by packaging all needed utility methods for performing fine-tuning on the pre-quantized weights on a Python package called onebitllms that we are going to cover in the following section.



Introducing onebitllms – a light-weight python package for 1-bit LLMs training toolkit

image/png

On this release, we also introduce onebitllms – a light-weight Python package that will be plugged into your favorite LLM fine-tuning tools with a view to fine-tune any pre-quantized BitNet model. At the moment of writing onebitllms exposes these important functionalities:

  • Utility method to convert the prequantized model checkpoints into BitNet training format with a view to pass it to any of your favorite LLM fine-tuning framework. We currently tested our library with Hugging Face’s trl library.
  • Utility method to quantize the trained checkpoint in BitNet format in addition to in usual bfloat16 format.
  • Fore more fine-grained control: Bare BitnetLinear and triton kernels that be injected and used on your pre-training framework.

Currently, only full-finetuning is supported through this framework, while on this release the model sizes are relatively small, supporting Parameter-Efficient Advantageous-tuning (PEFT) methods for BitNet models stays an exciting and impactful open query for upcoming BitNet models.

To start, simply install the package directly through pip or from source, and check out examples/ folders contained in the source code.

import torch

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer
from onebitllms import replace_linear_with_bitnet_linear, quantize_to_1bit

model_id = "tiiuae/Falcon-E-1B-Base"

tokenizer = AutoTokenizer.from_pretrained(model_id, revision="prequantized")
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    revision="prequantized"
)
model = replace_linear_with_bitnet_linear(model)

trainer = SFTTrainer(
    model,
    ...
)

trainer.train()

quantize_to_1bit(output_directory)

With this package, we hope to speed up research and development around ternary format LLMs, and hope to see many derivations of Falcon-Edge and other future powerful BitNet models developed by the community.



Going further

We imagine this release opens up multiple interesting directions – amongst all of the possible follow up directions, we currently think that the next open questions will make BitNet models rather more impactful within the near future:

  • Writing more powerful GPU inference kernels for BitNet architecture: leveraging the identical core ideas behind bitnet.cpp, we hope that this release will persuade the research community to deal with developping powerful BitNet inference kernels for faster inference on GPUs – thus making them faster than native models on GPUs.
  • Support PEFT methods for BitNet fine-tuning: This stays an unexplored research query that may open up multiple latest possibilities for BitNet models.
  • More rigourous investigation on the universality of Bitnet checkpoints: While we observe that simply injecting the load scale results in having a descent non-Bitnet checkpoint, we imagine that more research will be done to reduce the performance degradation between the Bitnet checkpoint and its bfloat16 counterpart, thus making it fully performance degradation-free.
  • On multi-modal Bitnet models: We hope these Bitnet foundational models along with onebitllms package can serve a as a foundational work for creating first multi-modal Bitnet VLM (Vision Language Model) etc.
  • More optimized Bitnet training kernels: To jot down our kernels, we decided to take a two stages approach to first compute the worldwide maximum to later use it block-wise for normalization. This approach will be revised to put in writing more efficient kernels. In our tests, we estimate the overhead to be around ~20% between non-Bitnet pre-training against Bitnet pre-training. We’ll release soon more extensive numbers on the overhead introduced by Bitnet for training.



Citation

If you happen to find this work useful on your research and work, please consider citing our work, in addition to citing all of the foundational work behind BitNet models:

@misc{tiionebitllms,
    title = {Falcon-E, a series of powerful, universal and fine-tunable 1.58bit language models.},
    writer = {Falcon-LLM Team},
    month = {May},
    url = {https://falcon-lm.github.io/blog/falcon-edge},
    12 months = {2025}
}
More References
@misc{ma2025bitnetb1582b4ttechnical,
      title={BitNet b1.58 2B4T Technical Report}, 
      writer={Shuming Ma and Hongyu Wang and Shaohan Huang and Xingxing Zhang and Ying Hu and Ting Song and Yan Xia and Furu Wei},
      12 months={2025},
      eprint={2504.12285},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.12285}, 
}
@misc{wang2025bitnetcppefficientedgeinference,
      title={Bitnet.cpp: Efficient Edge Inference for Ternary LLMs}, 
      writer={Jinheng Wang and Hansong Zhou and Ting Song and Shijie Cao and Yan Xia and Ting Cao and Jianyu Wei and Shuming Ma and Hongyu Wang and Furu Wei},
      12 months={2025},
      eprint={2502.11880},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.11880}, 
}
@misc{,
      title={1.58-Bit LLM: A Recent Era of Extreme Quantization}, 
      writer={Mohamed Mekkouri and Marc Sun and Leandro von Werra and Thomas Wolf},
      12 months={2024},
}
@misc{ma2024era1bitllmslarge,
      title={The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits}, 
      writer={Shuming Ma and Hongyu Wang and Lingxiao Ma and Lei Wang and Wenhui Wang and Shaohan Huang and Li Dong and Ruiping Wang and Jilong Xue and Furu Wei},
      12 months={2024},
      eprint={2402.17764},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2402.17764}, 
}
@misc{wang2023bitnetscaling1bittransformers,
      title={BitNet: Scaling 1-bit Transformers for Large Language Models}, 
      writer={Hongyu Wang and Shuming Ma and Li Dong and Shaohan Huang and Huaijie Wang and Lingxiao Ma and Fan Yang and Ruiping Wang and Yi Wu and Furu Wei},
      12 months={2023},
      eprint={2310.11453},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2310.11453}, 
}



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x