DEV Community

Alex Towell
Alex Towell

Posted on • Originally published at metafunctor.com

Succinct Bit Vectors and Rank/Select

The claim that drives this post: store $n$ bits and answer prefix-count queries in $O(1)$ time, using only $n + o(n)$ bits total. The auxiliary index is asymptotically negligible. That is not obvious, and it is worth understanding why it holds.

Constant-Time Queries on Bit Vectors

Posts 1 through 10 of this series focused on encoding: universal codes, arithmetic coding, Huffman, LZ77. They all ask the same question in slightly different ways: how do we turn a sequence into a compact bit string? This post shifts direction. The question here is different: once we have a bit vector, how do we query it efficiently without expanding it?

The two fundamental queries on a bit vector of $n$ bits are:

  • rank$_1(i)$: the number of 1-bits in positions $[0, i)$, i.e., strictly before position $i$.
  • select$_1(j)$: the position of the $j$-th 1-bit (0-indexed).

These appear throughout data structures: inverted indexes, compressed graphs, FM-indexes, wavelet trees. Rank tells you how many elements precede a position. Select inverts it.

Three design points exist along the space-time axis:

Approach Space rank time select time
Naive scan $n$ bits $O(n/64)$ $O(n/64)$
Full lookup table $O(n \log n)$ bits $O(1)$ $O(1)$
Succinct (this post) $n + o(n)$ bits $O(1)$ $O(\log n)$

The succinct approach hits the right trade-off: an auxiliary index that is asymptotically negligible (sublinear, so $o(n)$) while buying constant-time rank. Select costs $O(\log n)$ with a binary search. Getting $O(1)$ select requires one more index structure; that is covered briefly at the end and implemented in PFC's production version.

The Structure

The bit vector stores its $n$ bits packed into 64-bit words, LSB-first. Bit $i$ lives at position $i \bmod 64$ within word $\lfloor i / 64 \rfloor$. Unused bits in the final word are zeroed.

The auxiliary index has two levels:

  • Superblock array: one uint64_t per 4096-bit chunk. Entry $s$ holds the absolute cumulative rank from bit 0 to the start of superblock $s$.
  • Block array: one uint16_t per 64-bit word. Entry $b$ holds the rank within the enclosing superblock, from the superblock's start to the start of block $b$.
class SuccinctBitVector {
public:
    explicit SuccinctBitVector(const std::vector<bool>& bits);
    [[nodiscard]] std::size_t size()  const noexcept;
    [[nodiscard]] bool        bit(std::size_t i)  const noexcept;
    [[nodiscard]] std::size_t rank1(std::size_t i) const noexcept;   // O(1)
    [[nodiscard]] std::size_t select1(std::size_t j) const noexcept; // O(log n)

protected:
    std::size_t n_;
    std::vector<uint64_t> bits_;
    std::vector<uint64_t> superblock_ranks_;  // abs. rank at superblock boundaries
    std::vector<uint16_t> block_ranks_;       // superblock-relative rank per block

    static constexpr std::size_t SUPERBLOCK_BITS = 4096;
    static constexpr std::size_t BLOCK_BITS      = 64;
    static constexpr std::size_t BLOCKS_PER_SB   = 64;
};
Enter fullscreen mode Exit fullscreen mode

The production version lives in PFC's include/pfc/succinct.hpp. The pedagogical version in this post strips the production features to the core structure.

Why Constant Time

A rank query for position $i$ proceeds in three steps:

  1. Superblock lookup: $s = \lfloor i / 4096 \rfloor$. Load superblock_ranks_[s], giving the absolute count of 1-bits before superblock $s$.
  2. Block lookup: $b = \lfloor i / 64 \rfloor$. Load block_ranks_[b], giving the count within superblock $s$ before block $b$.
  3. Word popcount: mask the word at position $b$ to keep only bits $0 \ldots (i \bmod 64) - 1$, then call std::popcount.
std::size_t rank1(std::size_t i) const noexcept {
    if (i == 0) return 0;
    std::size_t sb    = i / SUPERBLOCK_BITS;
    std::size_t blk   = i / BLOCK_BITS;
    std::size_t w_off = i % BLOCK_BITS;
    std::size_t res   = superblock_ranks_[sb] + block_ranks_[blk];
    if (w_off > 0) {
        uint64_t mask = (uint64_t{1} << w_off) - 1;
        res += std::popcount(bits_[blk] & mask);
    }
    return res;
}
Enter fullscreen mode Exit fullscreen mode

All three steps are $O(1)$: two array indexing operations and one hardware-accelerated popcount (POPCNT on x86-64, VCNT on ARM). The key insight is that partial sums are precomputed at two granularities. The superblock array handles the large-scale prefix sum; the block array handles the within-superblock refinement. The final popcount handles the within-word residual. Three lookups, total.

Select via Binary Search

Select is harder than rank. Rank has a natural decomposition: superblock boundary, then block boundary, then word bits. Select does not. The $j$-th 1-bit could be anywhere, and without additional precomputed information, there is no shortcut.

The approach here uses binary search over superblock_ranks_ to find which superblock contains the $j$-th 1-bit (the largest $s$ such that superblock_ranks_[s] <= j). Within that superblock, a linear scan over the block array pinpoints the right block. Within that block, iterating over set bits via repeated __builtin_ctzll (count trailing zeros) finds the exact position.

Complexity: $O(\log(n / 4096))$ for the binary search over superblocks, then $O(1)$ per block within the superblock (at most 64 blocks), then $O(64)$ for the bit scan. Total: $O(\log n)$.

The $O(1)$ select extension stores one additional array: a "select samples" array holding the position of every $(\log^2 n)$-th 1-bit. That compresses the binary-search range and allows $O(1)$ select at the cost of a few extra kilobytes for typical $n$. PFC's production version implements this. For the pedagogical version here, $O(\log n)$ select is sufficient.

Where Succinct Bit Vectors Show Up

Inverted indexes in full-text search use bit vectors to represent posting lists (which documents contain a term). Rank and select over those bit vectors power gap compression: the posting list stores gaps between document IDs, and select recovers the original IDs.

The FM-index, used in short-read DNA alignment tools like BWA, builds a rank-queryable bit vector over the Burrows-Wheeler Transform. Each rank query takes $O(1)$ and drives an $O(m)$ pattern-search algorithm for a pattern of length $m$.

Compressed graphs (as in the WebGraph framework) store adjacency lists as bit vectors with succinct structures for traversal. Wavelet trees use a hierarchy of bit vectors to answer range-frequency queries in $O(\log \sigma)$ where $\sigma$ is the alphabet size.

In all of these, the $n + o(n)$ space guarantee is not academic. The index must not dwarf the data it queries.

The Space-Time Trade

The auxiliary index for a 10000-bit vector costs about 338 bytes: 24 bytes for the superblock array (3 entries at 8 bytes each) and 314 bytes for the block array (157 entries at 2 bytes each). The bit vector itself costs 1256 bytes (10000 / 8, rounded up to the nearest word). The index is roughly 27% of the bit vector, comfortably $o(n)$.

Here is the comparison table across the four main approaches:

Representation Space rank select Best for
Plain scan $n$ bits $O(n/64)$ $O(n/64)$ Very small $n$
Full table $O(n \log n)$ bits $O(1)$ $O(1)$ Tiny $n$, memory to burn
Succinct (this post) $n + o(n)$ bits $O(1)$ $O(\log n)$ Large dense bit vectors
Sparse / RLE $O(k)$ bits for $k$ ones $O(\log k)$ $O(\log k)$ Very sparse sets

The sparse / RLE row bridges to the next post. When a bit vector is extremely sparse, storing only the positions of the 1-bits beats the $n + o(n)$ dense approach. RoaringBitmap, covered in post 12, chooses dynamically between an array, a dense bit vector, and a run-length encoding based on the local density of each 64K-integer chunk.

Cross-References

This post builds on two earlier posts:

  • Post 1 established the Kraft lower bound: any lossless code must use at least $H$ bits per symbol. The succinct structure respects this by using $n + o(n)$ bits for an $n$-bit vector, asymptotically optimal.
  • Post 3 showed that a coding choice is implicitly a prior over the data. The choice of a dense bit vector (as opposed to an array or run-length encoding) reflects a prior that 1-bits are distributed roughly uniformly. When that prior is wrong, a different structure wins.

The next post shows how RoaringBitmap resolves this by refusing to commit to a single prior.


Footnote: PFC's include/pfc/succinct.hpp implements the full SuccinctBitVector with O(1) select via a select-samples index, and also includes RoaringBitmap with three container types. The pedagogical code in this post covers the core rank structure and O(log n) select.

Top comments (0)