Microsoft’s Inference Framework Brings 1-Bit Large Language Models to Local Devices

On October 17, 2024, Microsoft announced BitNet.cpp, an inference framework designed to run 1-bit quantized Large Language Models (LLMs). BitNet.cpp is a big progress in Gen AI, enabling the deployment of 1-bit LLMs efficiently on standard CPUs, without requiring expensive GPUs. This development democratizes access to LLMs, making them available on a wide selection of devices and giving latest possibilities in on-device AI applications.

Understanding 1-bit Large Language Models

Large Language Models (LLMs) have traditionally required significant computational resources because of their use of high-precision floating-point numbers (typically FP16 or BF16) for model weights. This necessity has made deploying LLMs expensive and energy-intensive.

At their core, 1-bit LLMs use extreme quantization techniques to represent model weights using only three possible values: -1, 0, and 1, hence the term “1.58-bit” (because it requires barely a couple of bit to encode three states).

Ternary Weight System

The Concept

The 1-bit quantization in BitNet.cpp is a ternary weight system. BitNet operates with only three possible values for every parameter:

-1 (negative)
0 (neutral)
1 (positive)

This ends in a storage requirement of around 1.58 bits per parameter, hence the name BitNet b1.58. This drastic reduction in parameter bit width results in a powerful reduction in memory usage and computational complexity, as most floating-point multiplications are replaced with easy additions and subtractions.

Mathematical Foundation

1-bit quantization involves transforming weights and activations into their ternary representation through the next steps:

1. Weight Binarization

Binarizing the weights involves centralizing them across the mean (α), leading to a ternary representation. The transformation is mathematically expressed as:

Wf=Sign(W−α)

Where:

W is the unique weight matrix.
α is the mean of the weights.
Sign(x) returns +1 if x > 0 and -1 otherwise.

2. Activation Quantization

Quantizing activations ensures that inputs are constrained to a specified bit width:

$x^_{e} = Quant (x) = Clip (γ x \times Q ^{b}, - Q_{b} + ϵ, Q_{b} - ϵ)$

Where:

Qb = $2^{(b-1)}$ is the utmost quantization level for b-bit width.
γ is the utmost absolute value of x (denoted as ).
ε is a small number to stop overflow during calculations.

3. BitLinear Operation

The BitLinear layer replaces traditional matrix multiplications with a simplified operation:

y=Wf×x^e×(Qbβγ)

Where:

β is a scaling factor used to reduce approximation errors.
γ scales the activations.
Q_b is the quantization factor.

This transformation enables efficient computations while preserving model performance.

Performance Implications

Memory Efficiency

The ternary weight system significantly reduces memory requirements:

Traditional LLMs: 16 bits per weight
BitNet.cpp: 1.58 bits per weight

This reduction translates to a memory savings of roughly 90% in comparison with traditional 16-bit models, allowing larger models to suit throughout the same hardware constraints.

Inference Speed, Energy Efficiency (Apple M2)

Inference Speed, Energy Efficiency (i7-13700H)

1. Inference Speed: Faster on Each CPUs

Inference speed is represented because the variety of tokens processed per second. Here’s a breakdown of the observations:

On Apple M2 Ultra: BitNet.cpp achieves as much as 5.07x speedup for larger models (30B) in comparison with Llama.cpp, with a peak speed of 593.43 tokens per second for a 125M model, which is a 1.37x speedup. For larger models just like the 3.8B and 7B, BitNet.cpp maintains a speed over 84.77 tokens per second, showing its efficiency across scales.
On Intel i7-13700H: BitNet.cpp achieves much more dramatic speed improvements. On the 7B model size, BitNet.cpp delivers an incredible 5.68x speedup in comparison with Llama.cpp. For smaller models like 125M, it processes 389.08 tokens per second, which is 2.37x faster than Llama.cpp.

2. Energy Efficiency: A Game-Changer for Edge Devices

The provided graphs also include energy cost comparisons, which shows a big reduction in energy consumption per token processed:

On Apple M2 Ultra: BitNet.cpp’s energy savings are substantial. For the 700M model, it consumes 55.4% less energy per token in comparison with Llama.cpp, dropping from 0.314 to 0.140. This trend continues for larger models, with the 70B model showing a 70.0% reduction in energy consumption.
On Intel i7-13700H: BitNet.cpp delivers 71.9% energy savings for the 700M model, with consumption dropping from 1.367 to 0.384. Although energy data for the 70B model in Llama.cpp is unavailable, BitNet.cpp stays efficient, with energy consumption at 17.33 for the 70B model.

3. Crossing the Human-Reading Speed Benchmark

Some of the interesting insights from these graphs is the reference to human reading speed, marked at 5-7 tokens per second. This red line shows that each implementations, especially BitNet.cpp, can comfortably surpass human reading speeds even for the biggest models:

On Apple M2 Ultra, BitNet.cpp surpasses human reading speed for all model sizes, with the bottom speed being 8.67 tokens per second for a 70B model.
On Intel i7-13700H, the 100B model still achieves 1.70 tokens per second, almost touching the lower range of human reading speed, while all smaller models surpass this benchmark.

Training Considerations

Straight-Through Estimator (STE)

Since 1-bit quantization introduces non-differentiable functions, training involves a specialized technique referred to as the Straight-Through Estimator (STE). On this approach, the gradients flow unaltered through non-differentiable points. Here’s a simplified implementation in Python:

class StraightThroughEstimator(Function):
    @staticmethod
    def forward(ctx, input):
        return input.sign()
    @staticmethod
    def backward(ctx, grad_output):
        return grad_output

Mixed Precision Training

To keep up stability during training, mixed precision is employed:

Weights and Activations: Quantized to 1-bit precision.
Gradients and Optimizer States: Stored in higher precision.
Latent Weights: Maintained in high precision to facilitate accurate updates during training.

Large Learning Rate Strategy

A singular challenge with 1-bit models is that small updates may not affect the binarized weights. To mitigate this, the training rate is increased, ensuring faster convergence and higher optimization in comparison with traditional approaches.

Group Quantization and Normalization

BitNet.cpp introduces Group Quantization and Normalization to reinforce model parallelism. As a substitute of calculating parameters for your complete weight matrix, BitNet divides weights and activations into multiple groups (G).

This grouping allows efficient parallel processing without additional inter-group communication, enabling large-scale model training and inference.

Implementation Notes and Optimizations

CPU Optimization

BitNet.cpp leverages several low-level optimizations to attain peak CPU performance:

Vectorized Operations: Utilizes SIMD instructions to perform bit manipulations efficiently.
Cache-Friendly Memory Access: Structures data to reduce cache misses.
Parallel Processing: Distributes workload across multiple CPU cores effectively.

Here’s an example of a key function implementing quantization and inference in BitNet:

 
def bitlinear_forward(input, weight, scale):
    # Quantize the input using absmax quantization
    input_q = quantize(input)
    
    # Perform binary matrix multiplication
    output = binary_matmul(input_q, weight)
    
    # Scale the output to match the unique precision
    return output * scale
def quantize(x):
    # Perform absmax quantization
    scale = torch.max(torch.abs(x))
    return torch.clamp(x / scale, -1, 1) * scale

Supported Models

The present release of BitNet.cpp supports the next 1-bit LLMs available on Hugging Face:

bitnet_b1_58-large (0.7B parameters)
bitnet_b1_58-3B (3.3B parameters)
Llama3-8B-1.58-100B-tokens (8.0B parameters)

These models are publicly available to show the framework’s inference capabilities. Although not officially trained or released by Microsoft, they illustrate the framework’s versatility.

Installation Guide

To start with BitNet.cpp, follow the steps below:

Prerequisites

Python >= 3.9
CMake >= 3.22
Clang >= 18
Conda (highly really helpful)

For Windows users, Visual Studio ought to be installed with the next components enabled:

Desktop Development with C++
C++-CMake Tools for Windows
Git for Windows
C++-Clang Compiler for Windows
MS-Construct Support for LLVM Toolset (Clang)

For Debian/Ubuntu users, an automatic installation script is offered:

Step-by-Step Installation

Clone the Repository:
Install Dependencies:
Construct and Prepare the Project: You possibly can download a model directly from Hugging Face and convert it to a quantized format:
Alternatively, manually download and convert the model:

Running Inference with BitNet.cpp

To run inference using the framework, use the next command:

Explanation:

-m specifies the model file path.
-p defines the prompt text.
-n sets the variety of tokens to predict.
-temp adjusts the sampling randomness (temperature) during inference.

Output Example

Technical Details of BitNet.cpp

BitLinear Layer

BitNet.cpp implements a modified Transformer architecture, substituting standard matrix multiplications with BitLinear operations. This approach centralizes weights to zero before quantization and scales them to cut back approximation errors. The important thing transformation function looks like this:

# Binarization function for 1-bit weights
def binarize_weights(W):
    alpha = W.mean()
    W_binarized = np.sign(W - alpha)
    return W_binarized

The mix of centralized weights and scaling ensures that the quantization error stays minimal, thus preserving performance.

Industry Impact

BitNet.cpp could have far-reaching implications for the deployment of LLMs:

Accessibility: Allows LLMs to run on standard devices, democratizing access to powerful AI.
Cost-Efficiency: Reduces the necessity for expensive GPUs, lowering the barrier for adoption.
Energy Efficiency: Saves energy by leveraging standard CPU-based inference.
Innovation: Opens latest possibilities for on-device AI, like real-time language translation, voice assistants, and privacy-focused applications without cloud dependencies.

Challenges and Future Directions

While 1-bit LLMs hold promise, several challenges remain. These include the event of strong 1-bit models for diverse tasks, optimizing hardware for 1-bit computation, and inspiring developers to adopt this latest paradigm. Moreover, exploring 1-bit quantization for computer vision or audio tasks represents an exciting future direction.

Conclusion

Microsoft’s launch of BitNet.cpp is a big advancement. By enabling efficient 1-bit inference on standard CPUs, BitNet.cpp creates the accessibility and sustainability of AI. This framework sets the stage for more portable and cost-effective LLMs, pushing what’s possible with on-device AI.

Microsoft’s Inference Framework Brings 1-Bit Large Language Models to Local Devices

Understanding 1-bit Large Language Models

Ternary Weight System

The Concept

Mathematical Foundation

1. Weight Binarization

Wf​=Sign(W−α)

2. Activation Quantization

x^e​=Quant(x)=Clip(γx×Qb​​,−Qb​+ϵ,Qb​−ϵ)

3. BitLinear Operation

y=Wf​×x^e​×(Qb​βγ​)

Performance Implications

Memory Efficiency

1. Inference Speed: Faster on Each CPUs

2. Energy Efficiency: A Game-Changer for Edge Devices

3. Crossing the Human-Reading Speed Benchmark

Training Considerations

Straight-Through Estimator (STE)

Mixed Precision Training

Large Learning Rate Strategy

Group Quantization and Normalization

Implementation Notes and Optimizations

CPU Optimization

Supported Models

Installation Guide

Prerequisites

Step-by-Step Installation

Running Inference with BitNet.cpp

Explanation:

Output Example

Technical Details of BitNet.cpp

BitLinear Layer

Industry Impact

Challenges and Future Directions

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

Wf=Sign(W−α)

$x^_{e} = Quant (x) = Clip (γ x \times Q ^{b}, - Q_{b} + ϵ, Q_{b} - ϵ)$

y=Wf×x^e×(Qbβγ)

What are your thoughts on this topic?
Let us know in the comments below.