Building an Ultra-Metric Tree in Clojure: From Radix Filters to p-adic Distance

#clojure #datastructures #algorithms #functional

Sorry, this wasn't a p-adic distance. It's a prefix distance tree. An updated version is available.

Data structures are more than just tools; they are elegant expressions of logic. Recently, I embarked on a journey to implement a fascinating structure known as an ultra-metric tree, and I chose to do it in my favorite functional language, Clojure. The path to this implementation wasn't a straight line but a winding road of ideas, connecting concepts from radix filters to the abstract world of p-adic numbers.

The Spark of an Idea

My journey began with an inspiring paper on efficient data filtering techniques.

Reference Paper: https://arxiv.org/pdf/2504.17033

Based on the referenced papers above, I decided to build the following initial idea.

Initial Inspiration: Yoshyhyrro/how_to_create_-/radix8_filter

The core idea I latched onto was the concept of filtering data by the number of its binary digits. It's a clever way to partition a search space. I started thinking about how to generalize this. A single decimal digit can be represented by at most four binary digits (since 2⁴=16), so I wondered: what if I group bits into larger, more convenient chunks?

This line of thought led me to a natural conclusion: let's use 8 bits (a single byte) as the fundamental unit. My first intuition was to build something akin to a bucket sort or a Radix Tree, where each level of the tree branches out based on the byte value at that position in the key.

A Bridge to p-adic Worlds

This is where my personal interests took the story in a new direction. I've always been fascinated by p-adic fields—a counter-intuitive number system where concepts of "closeness" are defined by divisibility by powers of a prime p, not by the usual absolute value.

Suddenly, my idea of using 8-bit chunks clicked perfectly with a fascinating mathematical concept: p-adic numbers. While it's a deep topic, the core idea is surprisingly simple. Unlike the standard number line where distance is measured by absolute value, p-adic numbers define "closeness" differently. Here, the distance between two numbers is determined by how many powers of a prime number p divide their difference.

We can apply this same principle to our data. Imagine each byte as a "digit" in a base-256 system. We can then define a distance metric based on this idea, which is known as an ultrametric distance. In this system, the distance between two keys is determined by the length of their shared prefix.

For two keys, k1 and k2, the 256-adic distance is (1/256)^n, where n is the number of initial bytes they share. This means two strings like "apple" and "apply" are "closer" than "apple" and "banana" because they share a longer common prefix. This felt incredibly elegant. The tree structure I had imagined was a perfect physical manifestation of this abstract distance metric.

As I started implementing, an exciting possibility emerged. I began to wonder if, somewhere within this structure, a form of "p-adic signature" might naturally appear. Could the paths within the tree reveal some deeper, encoded properties of the data itself, just as p-adic numbers provide a different lens through which to view integers? This thought became a powerful motivator for the project.

The Clojure Implementation

Clojure's functional style and immutable data structures were a perfect fit for building this tree. Here are a few highlights from the code.

1. The p-adic Distance Function

The distance function is the heart of the metric space. Its implementation is beautifully simple and efficient, relying on byte-array comparison.

(defn p-adic-distance
  "Calculates the 256-adic p-adic distance."
  [key1 key2]
  (let [^bytes bytes1 (byte-array-memo key1)
        ^bytes bytes2 (byte-array-memo key2)
        min-len (min (alength bytes1) (alength bytes2))]
    (loop [i 0]
      (cond
        ;; If one key is a prefix of the other
        (>= i min-len)
        (if (= (alength bytes1) (alength bytes2)) 
          0.0 ; Identical keys
          (/ 1.0 (Math/pow 256 i)))

        ;; The first differing byte determines the distance
        (not= (aget bytes1 i) (aget bytes2 i))
        (/ 1.0 (Math/pow 256 i))

        ;; Bytes match, continue
        :else (recur (inc i))))))

2. The Tree Node and Functional Insertion

The tree itself is a simple record. Insertions are purely functional—they return a new tree, leaving the original untouched. This avoids side effects and makes the code predictable.

(defrecord UMTreeNode [key value children level metrics])

(defn um-tree-insert [tree key value]
  ;; ... returns a new, updated tree ...
)

3. Optimized k-Nearest Neighbor Search

The real power of this structure shines in search operations. For k-Nearest Neighbors (k-NN), a naive approach would compare the query key to every other key. Instead, we use a Best-First-Search algorithm with a priority queue. This allows us to intelligently explore the tree, pruning entire branches that cannot possibly contain a better result.

(defn um-tree-k-nearest [tree search-key k]
  ;; ... uses a priority map to explore the most promising nodes first ...
)

A Quick Tutorial: Building and Querying the Tree

Talk is cheap, let's see the code in action. Here’s a short tutorial demonstrating how to build a tree and perform queries.

First, we start with an empty tree.

;; (ns ultra-metric-tree.demo
;;   (:require [ultra-metric-tree :as umt]))

;; Create a new, empty ultra-metric tree.
;; It's just a simple record under the hood.
(def my-tree (umt/empty-um-tree))

Next, we'll insert some key-value pairs. Since our implementation is purely functional, um-insert returns a new tree with the data added. We use -> (thread-first macro) to chain the insertions elegantly.

;; Insert a few key-value pairs.
;; Keys can be strings, numbers, or anything convertible to a byte array.
(def populated-tree
  (-> (umt/empty-um-tree)
      (umt/um-insert "apple" {:category "fruit", :color "red"})
      (umt/um-insert "application" {:category "software"})
      (umt/um-insert "apply" {:category "verb"})
      (umt/um-insert "banana" {:category "fruit", :color "yellow"})
      (umt/um-insert "bandana" {:category "headwear"})))

Now for the fun part: querying. Let's find the 3 nearest neighbors to the key "app".

;; Find the 3 nearest neighbors to the key "app".
;; The search is highly optimized and prunes large parts of the tree.
(umt/um-k-nearest populated-tree "app" 3)

;; => [;; [value original-key distance]
;;     [{:category "verb"} "apply" 0.0000152587890625]
;;     [{:category "fruit", :color "red"} "apple" 0.0000152587890625]
;;     [{:category "software"} "application" 5.9604644775390625E-8]
;;    ]

As you can see, "application", "apply", and "apple" are correctly identified as the closest keys because they share the prefix "app". The distances reflect the length of this shared prefix.

We can also perform a range query to find all entries within a certain distance.

;; Find all entries within a distance of 0.00002 from "band".
;; This is useful for similarity searches or auto-completion.
(umt/range-query populated-tree "band" 0.00002)

;; => [{:value {:category "headwear"}, :distance 0.0000152587890625, :key "bandana"}]

This simple example demonstrates the power and simplicity of the API. The tree handles the complexity of the p-adic space, providing fast and intuitive search capabilities.

Conclusion

This project was a rewarding exploration of how abstract mathematical concepts can inspire practical, high-performance data structures. Starting from a simple idea of filtering by binary digits, I found myself building a bridge to the world of p-adic numbers and implementing it all with the clarity and power of functional Clojure.

While the "p-adic signature" I imagined remains an elusive and exciting idea for future exploration, the result is a powerful and efficient ultra-metric tree ready for applications like fuzzy searching, clustering, and nearest-neighbor analysis.

You can find the full, commented source code in my repository. Happy hacking!

👈 Part 2: 🌌 High-Performance 3D Spatial Data Sorting with Morton Codes in Clojure.