Microsoft’s Inference Framework Brings 1-Bit Large Language Models to Local Devices | By The Digital Insider

On October 17, 2024, Microsoft announced BitNet.cpp, an inference framework designed to run 1-bit quantized Large Language Models (LLMs). BitNet.cpp is a significant progress in Gen AI, enabling the deployment of 1-bit LLMs efficiently on standard CPUs, without requiring expensive GPUs. This development democratizes access to LLMs, making them available on a wide range of devices and giving new possibilities in on-device AI applications.

Understanding 1-bit Large Language Models

Large Language Models (LLMs) have traditionally required significant computational resources due to their use of high-precision floating-point numbers (typically FP16 or BF16) for model weights. This necessity has made deploying LLMs expensive and energy-intensive.

At their core, 1-bit LLMs use extreme quantization techniques to represent model weights using only three possible values: -1, 0, and 1, hence the term “1.58-bit” (as it requires slightly more than one bit to encode three states).

Ternary Weight System

The Concept

The 1-bit quantization in BitNet.cpp is a ternary weight system.  BitNet operates with only three possible values for each parameter:

  • -1 (negative)
  • 0 (neutral)
  • 1 (positive)

This results in a storage requirement of around 1.58 bits per parameter, hence the name BitNet b1.58. This drastic reduction in parameter bit width leads to an impressive reduction in memory usage and computational complexity, as most floating-point multiplications are replaced with simple additions and subtractions.

Mathematical Foundation

1-bit quantization involves transforming weights and activations into their ternary representation through the following steps:

1. Weight Binarization

Binarizing the weights involves centralizing them around the mean (α), resulting in a ternary representation. The transformation is mathematically expressed as:

Wf=Sign(Wα)

Where:

  • W is the original weight matrix.
  • α is the mean of the weights.
  • Sign(x) returns +1 if x > 0 and -1 otherwise.

2. Activation Quantization

Quantizing activations ensures that inputs are constrained to a specified bit width:

Where:

  • Qb = 2(b−1)2^(b-1) is the maximum quantization level for b-bit width.
  • γ is the maximum absolute value of x (denoted as ∣∣x∣∣∞).
  • ε is a small number to prevent overflow during calculations.

3. BitLinear Operation

The BitLinear layer replaces traditional matrix multiplications with a simplified operation:

y=Wf×x^e×(Qbβγ)

Where:

  • β is a scaling factor used to minimize approximation errors.
  • γ scales the activations.
  • Q_b is the quantization factor.

This transformation enables efficient computations while preserving model performance.

Performance Implications

Memory Efficiency

The ternary weight system significantly reduces memory requirements:

  • Traditional LLMs: 16 bits per weight
  • BitNet.cpp: 1.58 bits per weight

This reduction translates to a memory savings of approximately 90% compared to traditional 16-bit models, allowing larger models to fit within the same hardware constraints.

Energy Efficiency

Inference Speed, Energy Efficiency (Apple M2)

Inference Speed: Faster on Both CPUs

Inference Speed, Energy Efficiency (i7-13700H)

1. Inference Speed: Faster on Both CPUs

Inference speed is represented as the number of tokens processed per second. Here's a breakdown of the observations:

  • On Apple M2 Ultra: BitNet.cpp achieves up to 5.07x speedup for larger models (30B) compared to Llama.cpp, with a peak speed of 593.43 tokens per second for a 125M model, which is a 1.37x speedup. For larger models like the 3.8B and 7B, BitNet.cpp maintains a speed over 84.77 tokens per second, showing its efficiency across scales.
  • On Intel i7-13700H: BitNet.cpp achieves even more dramatic speed improvements. At the 7B model size, BitNet.cpp delivers an incredible 5.68x speedup compared to Llama.cpp. For smaller models like 125M, it processes 389.08 tokens per second, which is 2.37x faster than Llama.cpp.

2. Energy Efficiency: A Game-Changer for Edge Devices

The provided graphs also include energy cost comparisons, which shows a significant reduction in energy consumption per token processed:

  • On Apple M2 Ultra: BitNet.cpp’s energy savings are substantial. For the 700M model, it consumes 55.4% less energy per token compared to Llama.cpp, dropping from 0.314 to 0.140. This trend continues for larger models, with the 70B model showing a 70.0% reduction in energy consumption.
  • On Intel i7-13700H: BitNet.cpp delivers 71.9% energy savings for the 700M model, with consumption dropping from 1.367 to 0.384. Although energy data for the 70B model in Llama.cpp is unavailable, BitNet.cpp remains efficient, with energy consumption at 17.33 for the 70B model.

3. Crossing the Human-Reading Speed Benchmark

One of the most interesting insights from these graphs is the reference to human reading speed, marked at 5-7 tokens per second. This red line shows that both implementations, especially BitNet.cpp, can comfortably surpass human reading speeds even for the largest models:

  • On Apple M2 Ultra, BitNet.cpp surpasses human reading speed for all model sizes, with the lowest speed being 8.67 tokens per second for a 70B model.
  • On Intel i7-13700H, the 100B model still achieves 1.70 tokens per second, almost touching the lower range of human reading speed, while all smaller models surpass this benchmark.

Training Considerations

Straight-Through Estimator (STE)

Since 1-bit quantization introduces non-differentiable functions, training involves a specialized technique known as the Straight-Through Estimator (STE). In this approach, the gradients flow unaltered through non-differentiable points. Here’s a simplified implementation in Python:


class StraightThroughEstimator(Function):
@staticmethod
def forward(ctx, input):
return input.sign()
@staticmethod
def backward(ctx, grad_output):
return grad_output

Mixed Precision Training

To maintain stability during training, mixed precision is employed:

  • Weights and Activations: Quantized to 1-bit precision.
  • Gradients and Optimizer States: Stored in higher precision.
  • Latent Weights: Maintained in high precision to facilitate accurate updates during training.

Large Learning Rate Strategy

A unique challenge with 1-bit models is that small updates might not affect the binarized weights. To mitigate this, the learning rate is increased, ensuring faster convergence and better optimization compared to traditional approaches.

Group Quantization and Normalization

BitNet.cpp introduces Group Quantization and Normalization to enhance model parallelism. Instead of calculating parameters for the entire weight matrix, BitNet divides weights and activations into multiple groups (G).

This grouping allows efficient parallel processing without additional inter-group communication, enabling large-scale model training and inference.

Implementation Notes and Optimizations

CPU Optimization

BitNet.cpp leverages several low-level optimizations to achieve peak CPU performance:

  • Vectorized Operations: Utilizes SIMD instructions to perform bit manipulations efficiently.
  • Cache-Friendly Memory Access: Structures data to minimize cache misses.
  • Parallel Processing: Distributes workload across multiple CPU cores effectively.

Here’s an example of a key function implementing quantization and inference in BitNet:

 
def bitlinear_forward(input, weight, scale):
# Quantize the input using absmax quantization
input_q = quantize(input)

# Perform binary matrix multiplication
output = binary_matmul(input_q, weight)

# Scale the output to match the original precision
return output * scale
def quantize(x):
# Perform absmax quantization
scale = torch.max(torch.abs(x))
return torch.clamp(x / scale, -1, 1) * scale

Supported Models

The current release of BitNet.cpp supports the following 1-bit LLMs available on Hugging Face:

  • bitnet_b1_58-large (0.7B parameters)
  • bitnet_b1_58-3B (3.3B parameters)
  • Llama3-8B-1.58-100B-tokens (8.0B parameters)

These models are publicly available to demonstrate the framework’s inference capabilities. Although not officially trained or released by Microsoft, they illustrate the framework’s versatility.

Installation Guide

To get started with BitNet.cpp, follow the steps below:

Prerequisites

  1. Python >= 3.9
  2. CMake >= 3.22
  3. Clang >= 18
  4. Conda (highly recommended)

For Windows users, Visual Studio should be installed with the following components enabled:

  • Desktop Development with C++
  • C++-CMake Tools for Windows
  • Git for Windows
  • C++-Clang Compiler for Windows
  • MS-Build Support for LLVM Toolset (Clang)

For Debian/Ubuntu users, an automatic installation script is available:

Step-by-Step Installation

  1. Clone the Repository:
  2. Install Dependencies:
  3. Build and Prepare the Project: You can download a model directly from Hugging Face and convert it to a quantized format:

    Alternatively, manually download and convert the model:

Running Inference with BitNet.cpp

To run inference using the framework, use the following command:

Explanation:

  • -m specifies the model file path.
  • -p defines the prompt text.
  • -n sets the number of tokens to predict.
  • -temp adjusts the sampling randomness (temperature) during inference.

Output Example

Technical Details of BitNet.cpp

BitLinear Layer

BitNet.cpp implements a modified Transformer architecture, substituting standard matrix multiplications with BitLinear operations. This approach centralizes weights to zero before quantization and scales them to reduce approximation errors. The key transformation function looks like this:


# Binarization function for 1-bit weights
def binarize_weights(W):
alpha = W.mean()
W_binarized = np.sign(W - alpha)
return W_binarized

The combination of centralized weights and scaling ensures that the quantization error remains minimal, thus preserving performance.

Industry Impact

BitNet.cpp could have far-reaching implications for the deployment of LLMs:

  • Accessibility: Allows LLMs to run on standard devices, democratizing access to powerful AI.
  • Cost-Efficiency: Reduces the need for expensive GPUs, lowering the barrier for adoption.
  • Energy Efficiency: Saves energy by leveraging standard CPU-based inference.
  • Innovation: Opens new possibilities for on-device AI, like real-time language translation, voice assistants, and privacy-focused applications without cloud dependencies.

Challenges and Future Directions

While 1-bit LLMs hold promise, several challenges remain. These include the development of robust 1-bit models for diverse tasks, optimizing hardware for 1-bit computation, and encouraging developers to adopt this new paradigm. Additionally, exploring 1-bit quantization for computer vision or audio tasks represents an exciting future direction.

Conclusion

Microsoft’s launch of BitNet.cpp is a significant advancement. By enabling efficient 1-bit inference on standard CPUs, BitNet.cpp creates the accessibility and sustainability of AI. This framework sets the stage for more portable and cost-effective LLMs, pushing what’s possible with on-device AI.


#1BitQuantizedLargeLanguageModels, #158BitsPerParameter, #2024, #Accessibility, #Adoption, #Ai, #Apple, #Applications, #Approach, #Architecture, #ArtificialIntelligence, #Audio, #Barrier, #Benchmark, #Binary, #BitNetCpp, #Cache, #Challenge, #Clamp, #CLIP, #Cloud, #Command, #Communication, #Complexity, #Computation, #ComputationalComplexity, #Computer, #ComputerVision, #Cpu, #CTX, #Data, #Debian, #Deploying, #Deployment, #Desktop, #Details, #Developers, #Development, #Devices, #Direction, #Edge, #Efficiency, #Employed, #Energy, #EnergyConsumption, #EnergyEfficiency, #Explanation, #Factor, #Framework, #Functions, #Future, #Game, #GenAi, #Giving, #GPUs, #Gradients, #Hardware, #HuggingFace, #Human, #Industry, #Inference, #Innovation, #Insights, #Intel, #It, #Language, #LanguageModels, #LargeLanguageModels, #Learning, #LESS, #Llama, #LLMs, #Mathematical, #Matrix, #MatrixMultiplications, #Max, #Memory, #Microsoft, #Mitigate, #Model, #ModelPerformance, #ModelTraining, #Models, #Notes, #OnDeviceAi, #One, #Optimization, #Overflow, #ParallelProcessing, #Parameter, #Performance, #Portable, #Prevent, #Privacy, #Project, #Python, #Quantization, #QuantizationTechniques, #Reading, #RealTime, #Red, #Reduction, #Repository, #Resources, #Savings, #Scale, #Scaling, #Speed, #Storage, #Sustainability, #Temperature, #Text, #Time, #Tools, #Torch, #Training, #Transformation, #Transformer, #TransformerArchitecture, #Translation, #Ubuntu, #Vision, #VisualStudio, #Voice, #Windows, #X
Published on The Digital Insider at https://is.gd/s1Jvk6.

Comments