The claim that drives this post: store $n$ bits and answer prefix-count queries in $O(1)$ time, using only $n + o(n)$ bits total. The auxiliary index is asymptotically negligible. That is not obvious, and it is worth understanding why it holds.
Constant-Time Queries on Bit Vectors
Posts 1 through 10 of this series focused on encoding: universal codes, arithmetic coding, Huffman, LZ77. They all ask the same question in slightly different ways: how do we turn a sequence into a compact bit string? This post shifts direction. The question here is different: once we have a bit vector, how do we query it efficiently without expanding it?
The two fundamental queries on a bit vector of $n$ bits are:
- rank$_1(i)$: the number of 1-bits in positions $[0, i)$, i.e., strictly before position $i$.
- select$_1(j)$: the position of the $j$-th 1-bit (0-indexed).
These appear throughout data structures: inverted indexes, compressed graphs, FM-indexes, wavelet trees. Rank tells you how many elements precede a position. Select inverts it.
Three design points exist along the space-time axis:
| Approach | Space | rank time | select time |
|---|---|---|---|
| Naive scan | $n$ bits | $O(n/64)$ | $O(n/64)$ |
| Full lookup table | $O(n \log n)$ bits | $O(1)$ | $O(1)$ |
| Succinct (this post) | $n + o(n)$ bits | $O(1)$ | $O(\log n)$ |
The succinct approach hits the right trade-off: an auxiliary index that is asymptotically negligible (sublinear, so $o(n)$) while buying constant-time rank. Select costs $O(\log n)$ with a binary search. Getting $O(1)$ select requires one more index structure; that is covered briefly at the end and implemented in PFC's production version.
The Structure
The bit vector stores its $n$ bits packed into 64-bit words, LSB-first. Bit $i$ lives at position $i \bmod 64$ within word $\lfloor i / 64 \rfloor$. Unused bits in the final word are zeroed.
The auxiliary index has two levels:
-
Superblock array: one
uint64_tper 4096-bit chunk. Entry $s$ holds the absolute cumulative rank from bit 0 to the start of superblock $s$. -
Block array: one
uint16_tper 64-bit word. Entry $b$ holds the rank within the enclosing superblock, from the superblock's start to the start of block $b$.
class SuccinctBitVector {
public:
explicit SuccinctBitVector(const std::vector<bool>& bits);
[[nodiscard]] std::size_t size() const noexcept;
[[nodiscard]] bool bit(std::size_t i) const noexcept;
[[nodiscard]] std::size_t rank1(std::size_t i) const noexcept; // O(1)
[[nodiscard]] std::size_t select1(std::size_t j) const noexcept; // O(log n)
protected:
std::size_t n_;
std::vector<uint64_t> bits_;
std::vector<uint64_t> superblock_ranks_; // abs. rank at superblock boundaries
std::vector<uint16_t> block_ranks_; // superblock-relative rank per block
static constexpr std::size_t SUPERBLOCK_BITS = 4096;
static constexpr std::size_t BLOCK_BITS = 64;
static constexpr std::size_t BLOCKS_PER_SB = 64;
};
The production version lives in PFC's include/pfc/succinct.hpp. The pedagogical version in this post strips the production features to the core structure.
Why Constant Time
A rank query for position $i$ proceeds in three steps:
-
Superblock lookup: $s = \lfloor i / 4096 \rfloor$. Load
superblock_ranks_[s], giving the absolute count of 1-bits before superblock $s$. -
Block lookup: $b = \lfloor i / 64 \rfloor$. Load
block_ranks_[b], giving the count within superblock $s$ before block $b$. -
Word popcount: mask the word at position $b$ to keep only bits $0 \ldots (i \bmod 64) - 1$, then call
std::popcount.
std::size_t rank1(std::size_t i) const noexcept {
if (i == 0) return 0;
std::size_t sb = i / SUPERBLOCK_BITS;
std::size_t blk = i / BLOCK_BITS;
std::size_t w_off = i % BLOCK_BITS;
std::size_t res = superblock_ranks_[sb] + block_ranks_[blk];
if (w_off > 0) {
uint64_t mask = (uint64_t{1} << w_off) - 1;
res += std::popcount(bits_[blk] & mask);
}
return res;
}
All three steps are $O(1)$: two array indexing operations and one hardware-accelerated popcount (POPCNT on x86-64, VCNT on ARM). The key insight is that partial sums are precomputed at two granularities. The superblock array handles the large-scale prefix sum; the block array handles the within-superblock refinement. The final popcount handles the within-word residual. Three lookups, total.
Select via Binary Search
Select is harder than rank. Rank has a natural decomposition: superblock boundary, then block boundary, then word bits. Select does not. The $j$-th 1-bit could be anywhere, and without additional precomputed information, there is no shortcut.
The approach here uses binary search over superblock_ranks_ to find which superblock contains the $j$-th 1-bit (the largest $s$ such that superblock_ranks_[s] <= j). Within that superblock, a linear scan over the block array pinpoints the right block. Within that block, iterating over set bits via repeated __builtin_ctzll (count trailing zeros) finds the exact position.
Complexity: $O(\log(n / 4096))$ for the binary search over superblocks, then $O(1)$ per block within the superblock (at most 64 blocks), then $O(64)$ for the bit scan. Total: $O(\log n)$.
The $O(1)$ select extension stores one additional array: a "select samples" array holding the position of every $(\log^2 n)$-th 1-bit. That compresses the binary-search range and allows $O(1)$ select at the cost of a few extra kilobytes for typical $n$. PFC's production version implements this. For the pedagogical version here, $O(\log n)$ select is sufficient.
Where Succinct Bit Vectors Show Up
Inverted indexes in full-text search use bit vectors to represent posting lists (which documents contain a term). Rank and select over those bit vectors power gap compression: the posting list stores gaps between document IDs, and select recovers the original IDs.
The FM-index, used in short-read DNA alignment tools like BWA, builds a rank-queryable bit vector over the Burrows-Wheeler Transform. Each rank query takes $O(1)$ and drives an $O(m)$ pattern-search algorithm for a pattern of length $m$.
Compressed graphs (as in the WebGraph framework) store adjacency lists as bit vectors with succinct structures for traversal. Wavelet trees use a hierarchy of bit vectors to answer range-frequency queries in $O(\log \sigma)$ where $\sigma$ is the alphabet size.
In all of these, the $n + o(n)$ space guarantee is not academic. The index must not dwarf the data it queries.
The Space-Time Trade
The auxiliary index for a 10000-bit vector costs about 338 bytes: 24 bytes for the superblock array (3 entries at 8 bytes each) and 314 bytes for the block array (157 entries at 2 bytes each). The bit vector itself costs 1256 bytes (10000 / 8, rounded up to the nearest word). The index is roughly 27% of the bit vector, comfortably $o(n)$.
Here is the comparison table across the four main approaches:
| Representation | Space | rank | select | Best for |
|---|---|---|---|---|
| Plain scan | $n$ bits | $O(n/64)$ | $O(n/64)$ | Very small $n$ |
| Full table | $O(n \log n)$ bits | $O(1)$ | $O(1)$ | Tiny $n$, memory to burn |
| Succinct (this post) | $n + o(n)$ bits | $O(1)$ | $O(\log n)$ | Large dense bit vectors |
| Sparse / RLE | $O(k)$ bits for $k$ ones | $O(\log k)$ | $O(\log k)$ | Very sparse sets |
The sparse / RLE row bridges to the next post. When a bit vector is extremely sparse, storing only the positions of the 1-bits beats the $n + o(n)$ dense approach. RoaringBitmap, covered in post 12, chooses dynamically between an array, a dense bit vector, and a run-length encoding based on the local density of each 64K-integer chunk.
Cross-References
This post builds on two earlier posts:
- Post 1 established the Kraft lower bound: any lossless code must use at least $H$ bits per symbol. The succinct structure respects this by using $n + o(n)$ bits for an $n$-bit vector, asymptotically optimal.
- Post 3 showed that a coding choice is implicitly a prior over the data. The choice of a dense bit vector (as opposed to an array or run-length encoding) reflects a prior that 1-bits are distributed roughly uniformly. When that prior is wrong, a different structure wins.
The next post shows how RoaringBitmap resolves this by refusing to commit to a single prior.
Footnote: PFC's include/pfc/succinct.hpp implements the full SuccinctBitVector with O(1) select via a select-samples index, and also includes RoaringBitmap with three container types. The pedagogical code in this post covers the core rank structure and O(log n) select.
Top comments (0)