On this blogpost, we present the important thing highlights and rationales concerning the Falcon-Edge series – a set of powerful, universal, and fine-tunable language models available in ternary format, based on the BitNet architecture.
Drawing from our experience with BitNet, Falcon-Edge introduces and validates an latest pre-training paradigm that delivers a full-scope output from a single training process, concurrently yielding each non-quantized and quantized model variants. This comprehensive approach produces a non-BitNet model in bfloat16 format, the native BitNet model, and a pre-quantized BitNet variant specifically engineered for effortless fine-tuning, enabling users and developers to exactly tailor these models to their specific applications and wishes.
Available now in two sizes—1 Billion and three Billion parameters—each size is available in each base and instruction-tuned models. Discover the Falcon-Edge series on our dedicated Hugging Face collection.
Introduction
Large Language Models (LLMs), by design, are inherently large and resource-intensive. As demand grows to deploy these models efficiently on edge devices, research into model compression has accelerated. Recent efforts, resembling those by DeepSeek and Llama 4, explore training with reduced precision formats—all the way down to FP8—to enhance deployment scalability. However, many state-of-the-art methods emphasize post-training quantization. In contrast to those approaches, BitNet introduces a fundamentally different paradigm: unlike reduced-precision training which still relies on floating-point formats, and post-training quantization which adjusts weights after full-precision training, BitNet operates with the bottom possible precision — ternary weights ({-1, 0, 1}) — directly during training, enabling an end-to-end ultra-efficient model design.
These ternary weights are paving the way in which for a “matmul-free” LLM design that’s notably faster and remarkably memory-efficient in practice. The first challenge of this modern approach is the need for pre-training BitNet models, which will be computationally demanding and dear for typical users.
Falcon-Edge, a series of powerful models
Leveraging the learnings from pre-training data strategies from our center, we pre-train our model on an internal data mixture for roughly 1.5 Tera Tokens. We use the classic WSD learning rate scheduler for pre-training.
We evaluate our models (base and instruct versions) on the previous Hugging Face leaderboard v2 benchmark and report the normalized results below compared with other models of comparable size:
Additional results (leaderboard v1) on comparing our instructed models with Microsoft’s latest BitNet model:
Falcon-Edge demonstrates on-par and higher performances than models of comparable sizes on the leaderboard v2 tasks, demonstrating that it is feasible to coach powerful BitNet models on desired domains while being competitive enough on other tasks.
Falcon-Edge, a series of universal models
If we glance closer on the formula of the BitNet linear layer for inference (by way of Python code):
def activation_norm_quant(x):
scale = 127.0 / x.abs().max(dim=-1, keepdim=True).values.clamp_(min=1e-5)
y = (x * scale).round().clamp_(-128, 127)
return y, scale
class BitLinear(nn.Linear):
def post_quant_process(self, input, input_scale, weight_scale):
out = input / (input_scale * weight_scale)
return out
def forward(self, input):
w = self.weight
w_quant = unpack_weights(w, dtype=self.dtype)
input_quant, input_scale = self.activation_quant(input)
y = F.linear(input_quant.to(self.dtype), w_quant)
y = self.post_quant_process(y, self.weight_scale, input_scale)
if self.bias is not None:
y += self.bias.view(1, -1).expand_as(y)
return y
The normalization activation_norm_quant quantizes the activations in int8 format, then the activation is computed back in half precision by diving it by x_scale. Because the model has been trained with fake 8-bit activation quantization, we argue that it is feasible to approximate that:
x_quant, x_scale = activation_norm_quant(x)
x ~= (x_quant / x_scale)
Subsequently, as an alternative of quantizing the model post-training, injecting the load scale after quantizing the weights should result in a very good enough “approximation” of the non-BitNet version of the model:
def _weight_quant(w):
scale = 1.0 / w.abs().mean().clamp_(min=1e-05)
u = (w * scale).round().clamp_(-1, 1)
return u, scale
for param_name, param_value in state_dict.items():
if _is_param_to_not_quantize(param_name):
proceed
param_value, param_scale = _weight_quant(param_value)
param_value = param_value / param_scale
state_dict_quant[param_name] = param_value
We confirm this by running end-to-end evaluations on the bfloat16 variant of our 1B and 3B base models and below are the outcomes:
The bfloat16 counterparts of the models will be loaded directly via Hugging Face transformers by passing revision="bfloat16" within the from_pretrained function:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer
model_id = "tiiuae/Falcon-E-1B-Base"
tokenizer = AutoTokenizer.from_pretrained(model_id, revision="prequantized")
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
revision="bfloat16"
)
Falcon-Edge, a series of fine-tunable Bitnet models
To the very best of our knowledge, except from probably the most recent release from Microsoft previous BitNet releases only deal with releasing the ultimate quantized model, making it usable just for inference. Similarly to the discharge from Microsoft, we propose to increase the accessibility of research and application of BitNet models by releasing their pre-quantized weights. That way, users can either perform fine-tuning on their goal domain, or do continuous pre-training of the BitNet checkpoint so long as nn.Linear layers are replaced by BitnetLinear layers, and by ensuring to quantize the model post training in BitNet format. Because the weights corresponds to the pre-quantized weights, performing text generation without replacing the nn.Linear layers with BitnetLinear layers will produce gibberish output.
The pre-quantized weights will be downloaded via Hugging Face’s transformers library by specifying the revision argument to be prequantized:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "tiiuae/Falcon-E-1B-Base"
tokenizer = AutoTokenizer.from_pretrained(model_id, revision="prequantized")
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
revision="prequantized"
)
This manner, we’ll help fostering an ecosystem around first powerful 1-bit fine-tunes by the community. We offer to community the tools to get easily began and fine-tune their very own version of powerful BitNet models by packaging all needed utility methods for performing fine-tuning on the pre-quantized weights on a Python package called onebitllms that we are going to cover in the following section.
Introducing onebitllms – a light-weight python package for 1-bit LLMs training toolkit
On this release, we also introduce onebitllms – a light-weight Python package that will be plugged into your favorite LLM fine-tuning tools with a view to fine-tune any pre-quantized BitNet model. At the moment of writing onebitllms exposes these important functionalities:
- Utility method to convert the prequantized model checkpoints into BitNet training format with a view to pass it to any of your favorite LLM fine-tuning framework. We currently tested our library with Hugging Face’s
trllibrary. - Utility method to quantize the trained checkpoint in BitNet format in addition to in usual
bfloat16format. - Fore more fine-grained control: Bare
BitnetLinearand triton kernels that be injected and used on your pre-training framework.
Currently, only full-finetuning is supported through this framework, while on this release the model sizes are relatively small, supporting Parameter-Efficient Advantageous-tuning (PEFT) methods for BitNet models stays an exciting and impactful open query for upcoming BitNet models.
To start, simply install the package directly through pip or from source, and check out examples/ folders contained in the source code.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer
from onebitllms import replace_linear_with_bitnet_linear, quantize_to_1bit
model_id = "tiiuae/Falcon-E-1B-Base"
tokenizer = AutoTokenizer.from_pretrained(model_id, revision="prequantized")
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
revision="prequantized"
)
model = replace_linear_with_bitnet_linear(model)
trainer = SFTTrainer(
model,
...
)
trainer.train()
quantize_to_1bit(output_directory)
With this package, we hope to speed up research and development around ternary format LLMs, and hope to see many derivations of Falcon-Edge and other future powerful BitNet models developed by the community.
Going further
We imagine this release opens up multiple interesting directions – amongst all of the possible follow up directions, we currently think that the next open questions will make BitNet models rather more impactful within the near future:
- Writing more powerful GPU inference kernels for BitNet architecture: leveraging the identical core ideas behind
bitnet.cpp, we hope that this release will persuade the research community to deal with developping powerful BitNet inference kernels for faster inference on GPUs – thus making them faster than native models on GPUs. - Support PEFT methods for BitNet fine-tuning: This stays an unexplored research query that may open up multiple latest possibilities for BitNet models.
- More rigourous investigation on the universality of Bitnet checkpoints: While we observe that simply injecting the load scale results in having a descent non-Bitnet checkpoint, we imagine that more research will be done to reduce the performance degradation between the Bitnet checkpoint and its
bfloat16counterpart, thus making it fully performance degradation-free. - On multi-modal Bitnet models: We hope these Bitnet foundational models along with
onebitllmspackage can serve a as a foundational work for creating first multi-modal Bitnet VLM (Vision Language Model) etc. - More optimized Bitnet training kernels: To jot down our kernels, we decided to take a two stages approach to first compute the worldwide maximum to later use it block-wise for normalization. This approach will be revised to put in writing more efficient kernels. In our tests, we estimate the overhead to be around ~20% between non-Bitnet pre-training against Bitnet pre-training. We’ll release soon more extensive numbers on the overhead introduced by Bitnet for training.
Citation
If you happen to find this work useful on your research and work, please consider citing our work, in addition to citing all of the foundational work behind BitNet models:
@misc{tiionebitllms,
title = {Falcon-E, a series of powerful, universal and fine-tunable 1.58bit language models.},
writer = {Falcon-LLM Team},
month = {May},
url = {https://falcon-lm.github.io/blog/falcon-edge},
12 months = {2025}
}
More References
@misc{ma2025bitnetb1582b4ttechnical,
title={BitNet b1.58 2B4T Technical Report},
writer={Shuming Ma and Hongyu Wang and Shaohan Huang and Xingxing Zhang and Ying Hu and Ting Song and Yan Xia and Furu Wei},
12 months={2025},
eprint={2504.12285},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.12285},
}
@misc{wang2025bitnetcppefficientedgeinference,
title={Bitnet.cpp: Efficient Edge Inference for Ternary LLMs},
writer={Jinheng Wang and Hansong Zhou and Ting Song and Shijie Cao and Yan Xia and Ting Cao and Jianyu Wei and Shuming Ma and Hongyu Wang and Furu Wei},
12 months={2025},
eprint={2502.11880},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.11880},
}
@misc{,
title={1.58-Bit LLM: A Recent Era of Extreme Quantization},
writer={Mohamed Mekkouri and Marc Sun and Leandro von Werra and Thomas Wolf},
12 months={2024},
}
@misc{ma2024era1bitllmslarge,
title={The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits},
writer={Shuming Ma and Hongyu Wang and Lingxiao Ma and Lei Wang and Wenhui Wang and Shaohan Huang and Li Dong and Ruiping Wang and Jilong Xue and Furu Wei},
12 months={2024},
eprint={2402.17764},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2402.17764},
}
@misc{wang2023bitnetscaling1bittransformers,
title={BitNet: Scaling 1-bit Transformers for Large Language Models},
writer={Hongyu Wang and Shuming Ma and Li Dong and Shaohan Huang and Huaijie Wang and Lingxiao Ma and Fan Yang and Ruiping Wang and Yi Wu and Furu Wei},
12 months={2023},
eprint={2310.11453},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2310.11453},
}






