The Mathematics That Make 1.58-bit Weights Work: How BitNet b1.58 Survives Its Own Quantization

#architecture #deeplearning #llm #machinelearning

Introduction:

Standard large language models are built on a foundation of high-fidelity math. They rely on float16 or bfloat16 weights, giving the network roughly 65,000 distinct values to represent each parameter. This method works because it has enormous numerical headroom. It can represent tiny calibrations, large corrections, and everything in between.
What BitNet b1.58 throws away all this expressiveness by compressing every decision a neural network can make into just three choices: push left, push right, or do nothing. It uses ternary weights , exactly three possible values: -1, 0, and +1. That equates to 1.58 bits of information per weight and roughly a 10x compression in size.
How does a model with only three possible weight values not collapse?
When I started scanning the actual weight tensors of the model, what I found wasn’t a near-miss or a clever hack. It was four distinct mathematical mechanisms operating in concert, each solving a different piece of the survival problem. This is a forensic account of what those mechanisms are, what the data revealed about them, and what the model independently invented for itself that never appeared in the original paper.

Mechanism 1: Absmean Quantization

The first layer is how the model chooses to map its high-precision thoughts into ternary values. The first problem is if you just round every weight to the nearest integer, your choice of rounding boundary determines which weights survive and which get zeroed out. Small shifts in that boundary can wipe out important features or preserve noise. BitNet solves this with a formula (that actually got popularized after the release of BitNet in 2023)

The elegance of the Absmean scale is that it dynamically centers the distribution before rounding . This normalizes the distribution so that the rounding boundary sits at the natural center of what this particular layer’s weights actually look like.
Our weight scan of the q_proj layers confirmed this across the full model depth:

The zero ratio (proportion of weights that become exactly 0 after quantization) sits between 42% and 51% across all layers. Roughly half of all weights vanish. This sounds alarming, but it’s the mechanism working correctly. Since a zero weight contributes nothing to the matrix multiply (it gets skipped during the multiplication) this translates directly into speed.

The outlier data is equally revealing. In layers 10–20 — the middle of the network, where representational activity is highest, we see the most extreme weights:

Layer 10: 0.01946% of weights exceed absolute value 10
Layer 20: 0.02797% exceed absolute value 10
These outliers matter because the absmean scale is set by the bulk of the distribution. A small number of extreme weights get heavily clipped, while the majority is quantized with appropriate resolution. The model is implicitly deciding that the tail of its weight distribution is less important than faithful representation of the core.

Mechanism 2: Weight Scale Tensors — Restoring What Quantization Takes Away

Quantizing to {-1, 0, +1} preserves direction and sign, but it obliterates magnitude. A weight that was 0.003 and a weight that was 0.847 both become +1 after ternary quantization they look identical but they represented completely different scales of influence. Without compensation, the outputs of every ternary layer would be uniformly scaled wrong.

To fix this, every linear layer is paired with a companion weight_scale bfloat16 tensor. At inference time, the model executes a restoration:

The integer matrix multiply runs fast (no floating-point required), and then a single scalar multiplication restores the correct magnitude for that layer. Our scan of the inference checkpoint found 328,775,890 bfloat16 parameters dedicated to scales, norms, and embeddings , alongside 521,011,200 uint8 parameters storing the packed ternary weights. The total model is 1.178 GB, compared to roughly 3 GB for an equivalent fp16 model.

These scales are actively working to inflate the signal. The scale values ranged from a minimum of 0.746 to a maximum of 4.594, with a mean of 2.331. But the breakdown by module type is more informative:

Attention layers: average scale 2.4391
MLP layers: average scale 2.1869
Interestingly, the data showed that attention layers require significantly more magnitude restoration than the MLP layers. Because these scales are trained end-to-end, the model isn’t just learning what the ternary weights should be; it is simultaneously learning exactly how much to rescale the outputs to compensate for its own compression.

Mechanism 3: Sub_norm as Adaptive Gain

This is where the forensic analysis produced something unexpected. BitNet has mostly followed the standard LLaMA-architecture, but this is where it deviates.
Standard LLaMA-architecture models have two types of normalization: the pre-attention RMSNorm and the pre-FFN RMSNorm. When we mapped the full shape inventory of BitNet b1.58’s safetensors, we found two additional normalization tensors that don’t appear in any standard LLaMA variant:

ffn_sub_norm: shape [6912] — applied after the FFN ternary multiply
attn_sub_norm: shape [2560] — applied after the attention ternary multiply
When we audited the bf16 master weights layer by layer, what we found was not static normalization behavior. We found a systematic, monotonically increasing gain structure:

Both the attention and FFN streams cross a “high gain” criticality threshold simultaneously. Before layer 14, sub_norm weights cluster near 1.0 (effectively no rescaling). After layer 14, they begin escalating in lockstep, reaching means of 9.32 (FFN) and 6.14 (attention) by the final layer.

Mathematics dictates that quantization error compounds across layers. When layer Nquantizes its weights, it introduces a small error the difference between the true float weight and its ternary approximation. The output error of Layer N becomes the input error of Layer N+1. By the end of the network, the accumulated mathematical noise requires roughly seven times more gain than at the beginning to extract the true signal.

The sub_norm weights display the exact pattern you would expect if the model had learned to compensate for compounding quantization error: The deeper you go, the larger that correction becomes. By the final layer, we have per-channel specialization and the gain is roughly 7–9× what it was at the start. Whether this is the mechanism training actually optimized for cannot be confirmed from weight inspection alone.

Layer 29’s attention sub_norm makes this concrete. Its average gain value is 6.14 while the variance across individual channels is 48.35. That’s an enormous spread. Compare that to layer 0, where every channel gets almost the same gain (variance near zero). What this means is that by layer 29, different channels inside the model have accumulated different amounts of error along the way which the model corrects individually, applying a unique gain value per channel rather than one blanket fix for the whole layer.
What makes this is remarkable is applying a separate scale to each channel is a standard technique in quantization engineering called per-channel quantization. Researchers use it deliberately when they want to minimize rounding error.

When we treat each sub_norm weight as a per-channel scale and measure how well it reconstructs the original float weights, the error is consistently lower than if we treat the whole layer as having a single shared scale.

(The gap narrows at layer 29 not because the sub_norm stops working, but because by that point the variance is so high that individual channels have diverged dramatically from any shared scale.)

The verdict from our mechanistic interpretability pass: the sub_norm layers are the scales. They are computing quantization correction on-the-fly, dynamically, per channel, with learned values that reflect the full depth of accumulated error the signal has experienced by that point.

Mechanism 4: RoPE Theta as Positional Headroom

To understand BitNet’s positional math, we first need to understand RoPE (Rotary Position Embedding). RoPE (Rotary Position Embedding) encodes position by rotating query and key vectors at frequencies that span a geometric range before attention is calculated. This makes the dot product naturally encode the relative distance between tokens. These rotations happen at different speeds: fast, high-frequency bands capture local context (adjacent words), while slow, low-frequency bands track long-range structure. The rope_theta parameter controls this frequency spread. A larger theta stretches the slow bands into even longer wavelengths, dramatically increasing the model’s positional resolution over long distances.

This is where the fourth mechanism operates, positional encoding. We see during analysis that BitNet b1.58 uses rope_theta = 500000.0. Standard LLaMA 2 uses rope_theta = 10000. That's a 50× increase in the base frequency parameter.

With theta=10000 and a head dimension of 128, the highest-frequency band has a wavelength of 6.3 tokens and the lowest-frequency band tops out around 25,600 tokens. With theta=500000, the lowest band’s wavelength extends to 2,559,195 tokens — far beyond any practical context length.

Our full saturation analysis at the model’s trained length of 4K tokens found:

32 bands already wrapping (saturation >100%): these rotate so fast that they cycle multiple times across the context window. Band 0, the highest-frequency band, saturates at 65,189%, wrapping 651 times. This is intentional: local relationships are encoded through periodicity, not monotonic position.
21 bands completely safe (<10% saturation): these barely rotate at all within 4K tokens. They are essentially untouched positional reserve.
Extending to 8K tokens, 35 bands wrap. At 12K, 37 bands wrap. The critical observation is that only bands 32–42 (wavelengths 4K–15K) require interpolation when extending context beyond the trained length. Everything outside that narrow window either wraps by design or remains safe at any realistic length.

Why does this matter for quantization? Because ternary weights are less expressive than float16 weights. Each weight can encode less information. The model compensates by ensuring that the positional dimension of information which layer does not compress, which is delivered through the rotary embeddings rather than through weights — arrives at every layer with maximum fidelity. High theta is positional headroom: it buys the model reserve encoding capacity in a dimension that quantization cannot touch.

The Elegance of Self-Correction

BitNet b1.58 does not succeed because we forced a square peg into a round hole. It succeeds because these four mechanisms form an unbroken cascade of mathematical compensation.
Absmean quantization adapts the rounding boundary to each layer’s actual weight distribution, minimizing worst-case error given the data. Weight scale tensors restore global magnitude after the ternary multiply, ensuring that the overall scale of activations stays correct. Sub_norm layers restore per-channel magnitude with a learned, depth-aware gain schedule compensating for thirty layers of compounding quantization error using correction values the model derived entirely from training signal. And high RoPE theta ensures that positional relationships, the one dimension of information that bypasses weight compression entirely, arrive with maximum resolution.

That is the real answer to why BitNet b1.58 doesn’t collapse. It’s not that 1.58 bits is somehow enough. It’s that the architecture learned to continuously correct for the fact that it isn’t — and by layer 29, the correction schedule it learned looks remarkably like what a human quantization engineer would have designed on purpose.