DEV Community: Geolm

bc_crunch : Compression Algorithm Documentation

Geolm — Fri, 12 Dec 2025 10:09:13 +0000

bc_crunch is a tiny, dependency-free C99 library for lossless compression of GPU-compressed texture blocks BC1, BC4, BC3, and BC5.

You can find it here : https://github.com/Geolm/bc_crunch

And contact me on bsky : https://bsky.app/profile/geolm.bsky.social

1. The Core Engine: Adaptive Arithmetic Coding (AAC)

Arithmetic Coding is a form of entropy encoding that maps an entire sequence of symbols to a single fractional number, often achieving better compression than Huffman coding because it can assign fractional bits per symbol.

Modeling (range_model): The compressor maintains a probability model for every type of data it encounters (e.g., color red delta, index difference, etc.). This model tracks how often each symbol (byte value or small integer) has occurred so far.

Adaptive Learning: The models are adaptive. After encoding a symbol, the model's counts are immediately updated. This allows the compressor to learn the statistics of the specific texture as it is being compressed, making it highly efficient even for unique data patterns.

Renormalization: The range_codec manages a base and length interval. As symbols are encoded, this interval shrinks. When it gets too small, a process called renormalization shifts the interval to output compressed bytes, maintaining precision and efficiency.

2. BC1 Compression (bc1_crunch)

BC1 blocks (8 bytes) consist of two 16-bit color endpoints and one 32-bit block of 16 2-bit indices.

A. Index Data Compression (The 32-bit Indices)

The 32-bit index pattern, which dictates the color of the 4×4 pixels, is often repeated across a texture. Your algorithm exploits this redundancy with a global dictionary.

Global Dictionary (Top Table) Creation:

The compressor first scans the entire texture and builds a histogram of every unique 32-bit index pattern.

It selects the Top 256 (TABLE_SIZE) most frequently occurring index patterns to form the top_table.

Encoding the Top Table:

The list of 256 patterns itself is compressed! Since the table is sorted, the patterns are similar.

The compressor encodes the difference (delta) between consecutive patterns in the table using Arithmetic Coding, which is very efficient for small, positive differences.

Encoding the Block Indices:

For each new block, the algorithm searches the top_table for the index pattern that is closest in binary representation (using Hamming distance, i.e., how many bits are different).

Reference Index: It encodes the index (table_index) of this nearest match in the top_table.

Mode Bit: It encodes a single bit (block_mode):

0 (Match): If the current block indices are exactly the same as the reference.

1 (Difference): If they are different, it computes the XOR difference and encodes the resulting 32-bit value, one byte at a time, using a dedicated AAC model (table_difference).

B. Color Endpoint Compression (The two 16-bit colors)

The color endpoints are compressed using a sophisticated form of Spatial Prediction and De-correlated Delta Encoding.

Spatial Prediction (Choosing a Reference):

The blocks are processed in a zig-zag pattern across the texture to maximize spatial locality.

For a block's color endpoint, the compressor calculates which of two neighbors offers the best prediction: the previous block in the row or the block immediately above it.

A single bit (color_reference) is encoded to tell the decoder which neighbor (and thus which color) to use as a reference for delta encoding.

De-correlated Delta Encoding:

The difference (delta) in Red, Green, and Blue components between the current color and the chosen reference color is calculated.

Green First: The Green component delta is encoded first.

R/B Prediction: The algorithm then uses the Green delta to predict the Red and Blue deltas (dred -= dgreen/2, dblue -= dgreen/2). This removes common information (e.g., if Green increases, R and B are likely to increase too), making the residuals smaller and easier to compress.

The residual Red and Blue deltas are encoded next. This sequence (Green -> Red residual -> Blue residual) significantly increases compression efficiency by de-correlating the color channels.

3. BC4 Compression (bc4_crunch)

BC4 blocks (8 bytes) consist of two 8-bit color endpoints (luminance/alpha) and one 48-bit block of 16 3-bit indices.

A. Color Endpoint Compression (The two 8-bit colors)

Compression relies on strong spatial prediction for Color 0 and local prediction for Color 1.

Color 0 (The First Endpoint):

It uses a Parallelogram Prediction (a common technique in image compression) to predict the current Color 0 value based on surrounding neighbors: Reference = Left + Up - UpLeft.

The difference (delta) between the actual Color 0 and this spatial reference is calculated (with wrapping modulo 256) and encoded (color_delta[0]).

Color 1 (The Second Endpoint):

It's encoded as a delta from the current block's Color 0. This assumes a strong relationship between the two endpoints within the same block.

B. Index Data Compression (The 48-bit Indices)

This is the most complex part, combining dictionary-based encoding with powerful contextual modeling for dictionary misses.

Adaptive Dictionary (Move-to-Front):

A small dictionary (DICTIONARY_SIZE = 256) of 48-bit index patterns is maintained.

This dictionary is a Move-to-Front (MTF) list: when an entry is used, it is moved to the front (index 0), ensuring the most recently used patterns have the shortest code length.

Dictionary Lookup and Mode Bit:

For the current block's 48-bit index pattern (bitfield), the compressor searches for the nearest match in the MTF dictionary based on Hamming distance.

Mode Bit (use_dict):

1 (Hit/Near Match): If the nearest match is very close (score < 4 bits different), it encodes 1. It then encodes the index of the match and the XOR difference (split into 16 separate 3-bit symbols), similar to BC1. The entry is moved to the front.

0 (Miss/New Pattern): If the match is poor, it encodes 0. The current pattern is pushed to the front of the MTF dictionary.

Contextual Local Prediction (Dictionary Miss):

If a pattern is not found in the dictionary, it's compressed locally using two levels of context:

Index-Chain XOR: The indices (processed in a zig-zag pattern within the block) are encoded as the XOR difference from the previously encoded index in the chain (block_previous ^ data).

Contextual Model Selection: The key innovation: the choice of which AAC model to use for this XOR difference is based on two things:

Endpoint Range: The overall difference between the block's two endpoints (int_abs(color[0] - color[1])). The code uses multiple buckets (e.g., range <8, range <32, or max range) to select a group of models (indices[24]).

Previous Index Value: Within that group, the specific model is selected by the value of the block_previous index. This means the probability of the next index is modeled based on the current index and the block's general contrast, making the prediction extremely precise.

onedraw — a GPU-driven 2D renderer

Geolm — Sun, 02 Nov 2025 13:24:14 +0000

Hi here's a doc I wrote about my open-source metal renderer, it's the first 2 parts. Next I'll write about the rasterization.

Don't hesitate to clone the repo, ask questions, report a bug or contact me. Have a nice day, Geolm.

URL : https://github.com/Geolm/onedraw
contact : Geolm

Goals and initial architecture

I started the project with the following objectives:

Not triangle-based: shapes are defined using signed distance functions (SDFs).
High quality: anti-aliased edges by default, perfectly smooth curves (no tessellation required), optimized for high-resolution displays.
Fast and GPU-driven: offload as much work as possible to the GPU and minimize draw calls.
Efficient alpha blending: designed to make extensive use of transparency without significant performance cost.

Overview

To achieve high performance, onedraw minimizes unnecessary computations during rasterization. Since it primarily uses signed distance functions (SDFs) to render shapes—and these functions can be relatively expensive—efficient culling is essential.

The screen is divided into 16×16 pixel tiles. A compute shader builds a linked list of draw commands per tile.

When a tile is rasterized, the fragment shader has direct access to the exact set of draw commands that affect that tile—ensuring that only relevant shapes are processed.

GPU-Driven Pipeline

The entire linked-list generation process happens on the GPU.

If a tile contains one or more draw commands, it’s automatically added to a list of tiles to be rendered.

Finally, an indirect draw call is issued to rasterize only those active tiles—eliminating CPU overhead and allowing fully GPU-driven rendering.

Binning commands

Binning is the process that generates the per-tile linked lists of draw commands.

It works by performing intersection tests between each draw command and the bounding box of every tile.

Classic intersection methods are used, such as the Separating Axis Theorem (SAT) for oriented shapes and distance-to-center checks for circular ones.

Some additional factors are also considered:

anti-aliasing
groups of shapes
smoothmin

Anti-aliasing

The width of the anti-aliasing region (defined by the user in onedraw) must be taken into account during intersection testing.

In the example above, even if the disc does not mathematically intersect the tile’s bounding box, the anti-aliasing width extends beyond it.

To avoid visible seams or straight edges along tile borders, the disc is still added to the tile’s linked list so that edge pixels are correctly shaded.
We simply grow the size of the bounding box by the width of the anti-aliasing.

Group of shapes

onedraw supports combining multiple shapes into a group.

A group behaves like a single shape: for example, it can have a global transparency even if it contains multiple layers of shapes, or it can have an outline that applies to the entire group.

Groups are defined by wrapping shapes between begin group and end group commands.

The end group command triggers the color output (and, if enabled, the outline).

Each begin/end group command is assigned a global bounding box that encompasses all shapes within the group.

This ensures that both the begin and end commands are included whenever the group affects a tile.

If a group has no shapes intersecting the tile, the linked list is adjusted to skip that unused group, avoiding unnecessary processing.

Smoothmin

The smoothmin operator allows multiple shapes to blend smoothly, even when they don’t intersect.

The k-factor (as described on Inigo Quilez’s website) controls how far a shape can blend with another.

During binning, the tile’s bounding box is expanded by this value to ensure that shapes contributing to a smooth blend are not missed.

Performance considerations

Even though binning is performed on the GPU, it’s not free. Let’s look at the numbers:

At 1440p, the screen contains 160 × 90 = 14,400 tiles
onedraw supports up to 65,536 draw commands
In the worst case, that’s around 943 million intersection tests

With such a naïve approach, the binning step could easily cost more than the actual rasterization. 😬

Quantized AABB

The first optimization is to store an AABB (axis-aligned bounding box) for each command and use it as a quick pre-test before running more complex intersection math.

The bounding boxes are quantized to the tile resolution (16 pixels), making the test extremely fast and simple.

Additionally, the AABBs are stored in a separate buffer from the draw commands to avoid cache thrashing and improve memory access efficiency.

Hierarchical binning

Even with AABB pre-tests, having many commands means each tile still performs a significant amount of work.

To reduce this, we introduce hierarchical binning — pre-building a list of commands for larger screen regions.

A region covers 16 × 16 tiles, and each region keeps a list of the commands that affect it.

When binning individual tiles, we only consider the commands from that region’s list instead of all global commands.

This sounds great, but there’s a catch:
When binning commands for the region in a compute shader, if we assign one thread per region, it becomes very slow — there are relatively few regions but potentially many commands. GPUs perform best when running a large number of lightweight threads, not a few heavy ones.

To fix this, we invert the process: each thread processes a single command and adds it to the region lists it intersects.

However, since this happens in parallel, we lose the guaranteed ordering of commands, which is critical in 2D rendering.

Predicate + exclusive scan

To keep the order of commands, we use the classic pattern :

A predicate compute shader evaluates the visibility of each command (one thread per command) and writes the result — 0 or 1 — to a buffer.
An exclusive scan compute shader then processes this buffer to build a compact list of visible command indices.
Finally, another compute shader uses this predicate and the indices lists to write the corresponding commands into the output buffer. This approach allows us to keep a thread-per-command model while efficiently filtering out invisible ones.

This process is applied for all regions.

Intrinsics trick

Typically, the exclusive scan pass consists of multiple cascaded compute shaders to produce the final result. In our case, however, since we have a maximum of 65k commands and know the SIMD group size on Apple Silicon, we can perform the entire operation in a single pass using the simd_prefix_exclusive_sum function and threadgroup memory.

You can look at the binning shader.