nitish kumar

Posted on Apr 19

What if a rope data structure used p-adic geometry?

#rust #algorithms #markdown #webassembly

I was building a markdown editor for my platform StringTechHub. At some point I needed to decide how to represent the document internally. I could use a plain string, a gap buffer, or a standard rope. Instead I asked a different question: can number theory give us a better way to organize text than size-balanced trees?

This post is about what that means, what I built, what I measured, and what I learned.

THE PROBLEM WITH STANDARD APPROACHES

A plain string is simple but expensive. Every insert at position i requires shifting all characters after i — O(n) per edit. For a 30,000 character document that becomes noticeable.

A rope solves this by splitting the document into chunks stored in a binary tree. Insert and delete become O(log n) because you only touch the path from root to the affected leaf. Editors like Zed and Helix use rope-based structures internally for this reason.

Standard ropes balance by size — keep subtree weights roughly equal, keep the tree shallow. The size-balancing heuristic is correct but blind. It knows nothing about what the text means. A paragraph and a code block of identical byte length are indistinguishable to a standard rope. I wanted the tree to know the difference.

WHAT P-ADIC NUMBERS ARE

The 2-adic valuation v₂(n) counts how many times 2 divides n:

v₂(8) = 3, because 8 = 2³

v₂(12) = 2, because 12 = 4 × 3

v₂(7) = 0, because 7 is odd

In the 2-adic metric, two numbers are close when their difference is highly divisible by 2. This is a different notion of proximity than the real number line — one based on arithmetic structure rather than magnitude.

The question: what happens when you use this arithmetic structure to decide which nodes in a rope tree should merge together?

THE CORE IDEA

In a standard rope, rebalancing collects all leaf nodes and rebuilds by splitting arrays in half — purely by position. In a p-adic rope, rebalancing collects all leaf nodes and rebuilds by greedily fusing chunks that fit well arithmetically — merging adjacent nodes whose combined size has the highest 2-adic valuation first.

If two adjacent leaves have sizes 174 and 46, their combined size is 220. v₂(220) = 2. If two other adjacent leaves have sizes 171 and 16, their combined size is 187. v₂(187) = 0. The first pair merges first.

The rebuild uses a max-heap ordered by merge score, so the highest valuation pairs always merge first. For p=2, the valuation is a single CPU instruction:

fn valuation(n: usize) -> usize {
    if n == 0 { return usize::MAX; }
    n.trailing_zeros() as usize
}

The result: a tree where subtrees reflect arithmetic affinity between block sizes — not just positional proximity. Certain subtrees become deeper or shallower depending on number-theoretic structure, not size alone.

WHAT I BUILT

The data structure has two node types. Internal nodes are lightweight — they store weight, total size, valuation, cached depth, and two child pointers. No content, no metadata. Leaf nodes are rich — they store raw text, a parsed block from the document parser, a dirty flag, and an AST cache.

pub struct InternalNode {
    pub weight: usize,
    pub total: usize,
    pub valuation: usize,
    pub depth: usize,
    pub left: Box<Node>,
    pub right: Box<Node>,
}

pub struct LeafNode {
    pub text: String,
    pub block: RefCell<Block>,
    pub dirty: Cell<bool>,
    pub ast_cache: RefCell<Option<Block>>,
}

Each leaf corresponds to exactly one semantic block. A heading is one leaf. A paragraph is one leaf. A code block including all indented lines is one leaf. Split operations snap to block boundaries — the rope never cuts mid-paragraph.

The depth cap guarantees O(log n) worst case: max_depth = 2.0 × log₂(total). After every concat, depth is checked. If it exceeds the cap, p-adic rebuild runs. If that still violates the cap, strict size-balanced fallback runs. The guarantee always holds.

THE SEMANTIC EVOLUTION

While building I noticed something: lists and paragraphs have fundamentally different natural groupings. A paragraph is continuous prose — binary splitting fits naturally. An unordered list is a collection of discrete items — ternary grouping fits better.

So I made the prime adaptive:

fn block_prime(node: &Node) -> usize {
    match node {
        Node::Leaf(leaf) => match &leaf.block.borrow().kind {
            BlockKind::UnorderedList |
            BlockKind::OrderedList => 3,
            _ => 2,
        },
        Node::Internal(_) => 2,
    }
}

fn merge_score(left: &Node, right: &Node) -> usize {
    let total = left.total() + right.total();
    let p = block_prime(left).max(block_prime(right));
    valuation_p(total, p)
}

This is where the idea became more interesting: the prime is no longer fixed — it depends on the content. The math adapts to document structure. This is no longer purely p-adic. It is a mathematically inspired adaptive heuristic. I think that is more interesting than mathematical purity, and more honest than pretending the original idea was perfect.

THE CUSTOM DSL

The editor uses a command-based markup language. Every block starts with a dollar-sign command at column zero. Everything indented below belongs to the same block. A new block begins when a $ appears at column zero.

BLOCK COMMANDS
$H1 - heading text

$H2 - heading text

$H3 - heading text

$H4 - heading text

$P - paragraph text

$B - bold block

$I - italic block

$BI - bold italic block

$ST - strikethrough block

$HR

$BQ quote content here -- attribution

$CODE rust fn main() { println!("hello"); }

$UL item one item two item three

$OL first item second item

$IMG - alt text | https://url.com/image.png
$LINK - label text | https://url.com

$FN - footnote content

$TOC

$TB | Header 1 | Header 2 | Header 3 | |----------|----------|----------| | cell | cell | cell | | cell | cell | cell |

Tables are scaffolded automatically. Type $TB and press Enter — the editor generates the header row, separator row, and data rows. Tab moves between cells, Shift+Tab moves back. For custom tables, $TB COLS|ROWS then click enter (ex: $TB 3|3).

INLINE COMMANDS
Inline formatting uses quoted commands anywhere inside block content:

"$I - italic text"

"$B - bold text"

"$BI - bold italic"

"$ST - strikethrough"

ESCAPE
To write a literal dollar sign command without it being parsed in inline:

"!$B" - this renders as plain text, not bold

To write literal quotes inside inline content:

"!!this has literal quotes inside"

The explicit $ prefix at column zero makes block boundary detection trivial and unambiguous. No context-sensitive parsing, no lookahead, no ambiguity. This is why the rope can snap splits to block boundaries cleanly — the boundary is always a $ at column zero.

WHAT THE MEASUREMENTS SHOWED

I built a baseline using standard size-balanced rebuild and ran both on the same document.

Size-balanced baseline: depth 6, total internal nodes 27

valuation distribution: flat, stops at v₂=4

P-adic rope after operations:

depth 9-13, total internal nodes 1528 valuation distribution: geometric decay

v₂=0 → 594 nodes

v₂=1 → 505 nodes

v₂=2 → 240 nodes

v₂=3 → 114 nodes

v₂=4 → 49 nodes

v₂=5 → 19 nodes

v₂=6 → 5 nodes

v₂=7 → 2 nodes

The geometric decay is not accidental. In any distribution of natural numbers, roughly half are odd (v₂=0), a quarter divisible by 2 but not 4 (v₂=1), an eighth by 4 but not 8 (v₂=2), and so on. The histogram follows this pattern exactly — the p-adic rebuild is genuinely organizing nodes by number-theoretic properties. This suggests the structure is not arbitrary — it reflects inherent arithmetic distribution.

The size-balanced tree has no such structure. Its valuation distribution is flat and arbitrary.

Rebalance cost: 17µs for a real document.

1000 inserts: 46ms total, 0.046ms per insert.

I have not yet benchmarked against a production rope implementation like ropey. That is the honest next step. Whether p-adic balancing produces faster trees than size-balanced in real editor workloads is an open question. The tree shapes are measurably different. Whether different is better in practice I cannot yet claim.

THE EDITOR

The parser compiles to WASM via wasm-pack. The editor is a split-pane textarea with live preview. The rope sits between the parser and the renderer.

Each leaf caches its parsed block. Dirty flags mark which leaves need re-parsing after edits. Only dirty leaves get re-parsed on each update — re-parse cost is O(block size) not O(document size). For a 30,000 character document with a 200-character paragraph being edited, only those 200 characters get re-parsed.

Currently the textarea is the source of truth and the rope is rebuilt from it on each debounce. Making the rope the primary source of truth — intercepting every keystroke and updating incrementally — is the next step. The infrastructure is there. The wiring is not complete.

HONEST ASSESSMENT

What works:

P-adic rebuild produces measurably different tree structure from size-balanced
Geometric decay in valuation histogram is real and mathematically explainable
Semantic block-prime scoring connects math to document structure
Dirty flag incremental parsing is implemented and correct
Depth cap guarantee holds under adversarial input 46ms for 1000 inserts on a real document
Compiles to WASM, runs in browser, handles 30k character documents without lag

What does not work yet:

Rope is not the source of truth — full rebuild on each debounce
True incremental updates not wired end to end
No undo/redo
No benchmark against production rope implementations

WHAT I LEARNED

The most interesting discovery was not the math. It was that the prime should depend on the content type. The moment I realized lists have a natural ternary structure and paragraphs have a natural binary structure — the math stopped being a curiosity and started being a design tool.

The second thing: building something real with an idea is the only way to find out if the idea works. I could have written about p-adic ropes. Instead I built one and wired it into an editor. The measurements are more honest than the theory.

The open question: does p-adic balancing actually improve real editor workloads, or does it just produce different trees? I do not know yet. That is what makes it worth continuing.

Code: github

Note: Early-stage prototype. Core ideas are implemented and tested; editor integration and features are still in progress.

DEV Community