Swabri Musa

Posted on Mar 16

Building Jam-Text: A High-Performance Text Indexer in Go

#go #indexing #algorithms #performance

Last month, I teamed up with some brilliant folks—Elijah, Kevin, Jerome, and Godwin—for a hackathon challenge: build a fast, scalable text indexer in Go. The goal? Create a tool that could chew through massive text files, fingerprint chunks for similarity, and retrieve content lightning-fast.

The result was Jam-Text, a CLI-powered indexer with SimHash fingerprints, vector similarity via random hyperplanes, and a sprinkle of multi-threading magic. Here’s the story of how we built it, the hurdles we faced, and what I learned along the way.

The Challenge: Speed, Scale, and Similarity

The hackathon’s problem statement was deceptively simple:

Parse a text file into chunks (say, 4KB each).
Generate SimHash fingerprints.
Build an in-memory index.
Enable quick lookups.
Bonus points for parallel processing and fuzzy matching.

But beneath that simplicity lay a beast—how do you make it fast and scalable for huge datasets while keeping memory in check?

We started with a basic question: What’s similarity anyway?

For plagiarism detection, you want exact or near-exact matches.
For content finding, semantic similarity matters more (e.g., "cat" vs. "feline").

This led us to a hybrid approach:

SimHash for exact lookups (per the CLI spec).
Vector similarity with random hyperplanes for fuzzy, scalable searches.

Diving In: The Big Idea from LSH

Our inspiration came from Moses Charikar’s paper, "Similarity Estimation Techniques from Rounding Algorithms". It introduced Locality-Sensitive Hashing (LSH)—a genius trick to fingerprint complex data (like text chunks) into short, comparable sketches.

The paper offered three flavors:

Min-Hash: Great for detecting overlap (e.g., duplicates).
Random Hyperplanes: Converts text into vectors (word counts), measures cosine similarity—perfect for meaning.
Earth Mover Distance: A fancy distribution metric (overkill for us).

We chose random hyperplanes for its semantic power, pairing it with SimHash to meet the CLI’s -h <simhash_value> requirement. Here’s how we stitched it together.

Building Jam-Text: The Tech Journey

Step 1: Chunking the Text

We started with a chunker package to split files into fixed-size chunks (default 4KB). Early versions were single-threaded—slow for big files. So, we added a worker pool, splitting the file into sections and processing them in parallel with goroutines.

`chunker/chunker.go`

func (c *Chunker) ProcessFile(filePath string) ([]Chunk, error) {
    fileSize := getFileSize(filePath)
    numWorkers := 4
    chunkChan := make(chan Chunk, 100)
    var wg sync.WaitGroup
    sectionSize := fileSize / int64(numWorkers)

    // Spin up workers
    for i := 0; i < numWorkers; i++ {
        start := int64(i) * sectionSize
        end := start + sectionSize
        if i == numWorkers-1 { end = fileSize }
        wg.Add(1)
        go c.processSection(filePath, start, end, chunkChan)
    }

    // Collect chunks
    go func() { wg.Wait(); close(chunkChan) }()
    var chunks []Chunk
    for chunk := range chunkChan { chunks = append(chunks, chunk) }
    return chunks, nil
}

Step 2: Fingerprinting with a Hybrid Twist

Next, we tackled fingerprinting in the simhash package.

SimHash hashes words into a 64-bit fingerprint—fast and compact.
Vector similarity: Each chunk became a vector (word counts over a simple vocabulary), and we generated 10-bit sketches using random hyperplanes.

`simhash/simhash.go`

type Fingerprint struct {
    SimHash     uint64
    VecSketches []int
}

func Hash(content []byte) Fingerprint {
    vector := toVector(string(content))
    return Fingerprint{
        SimHash:     simHash(string(content)),
        VecSketches: vectorSketches(vector, 10),
    }
}

Step 3: Indexing with LSH

The index package stored SimHash for exact lookups and LSH bucketing for vector sketches.

`index/index.go`

type Index struct {
    Entries    map[uint64]IndexEntry
    LSHBuckets map[int][]IndexEntry
    mu         sync.RWMutex
}

func (idx *Index) AddEntry(fp simhash.Fingerprint, offset int64) {
    idx.mu.Lock()
    defer idx.mu.Unlock()
    idx.Entries[fp.SimHash] = IndexEntry{Fingerprint: fp, Offsets: []int64{offset}}
    bucketKey := fp.VecSketches[0]
    idx.LSHBuckets[bucketKey] = append(idx.LSHBuckets[bucketKey], entry)
}

Step 4: CLI Magic

The cli package brought it to life with commands like index, lookup, compare, and fuzzy.

Example CLI usage

./textindex -c index -i large_text.txt -s 4096 -o index.idx
./textindex -c lookup -i index.idx -h 3eff1b2c98a6
./textindex -c compare -i doc1.txt -i2 doc2.txt -o report.txt
./textindex -c fuzzy -i testdata.idx -h $HASH -threshold 0.8

Challenges and Triumphs

The Vocabulary Headache

Vector similarity needed a vocabulary. Building it dynamically was slow, so we pre-sampled from the file. Not perfect, but it worked for the demo.

Balancing Speed and Memory

SimHash was lean (8 bytes/chunk).
10 vector sketches bumped it to ~18 bytes/chunk.
We tuned it down from 20 sketches to 10—good enough for fuzzy matching without hogging RAM.

Scaling with LSH

Linear search was... bad. LSH buckets improved recall and speed, but tuning the sketch size vs. false positives was tricky.

What’s Next?

We’re thinking about:

Incremental Indexing for real-time updates.
Approximate Nearest Neighbors (ANN) for even faster fuzzy searches.
Persistent Storage (LSM trees, RocksDB?).

Would you use Jam-Text? Have thoughts on improvements? Let’s discuss in the comments! 🚀

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

DEV Community