Last month, I teamed up with some brilliant folks—Elijah, Kevin, Jerome, and Godwin—for a hackathon challenge: build a fast, scalable text indexer in Go. The goal? Create a tool that could chew through massive text files, fingerprint chunks for similarity, and retrieve content lightning-fast.
The result was Jam-Text, a CLI-powered indexer with SimHash fingerprints, vector similarity via random hyperplanes, and a sprinkle of multi-threading magic. Here’s the story of how we built it, the hurdles we faced, and what I learned along the way.
The Challenge: Speed, Scale, and Similarity
The hackathon’s problem statement was deceptively simple:
- Parse a text file into chunks (say, 4KB each).
- Generate SimHash fingerprints.
- Build an in-memory index.
- Enable quick lookups.
- Bonus points for parallel processing and fuzzy matching.
But beneath that simplicity lay a beast—how do you make it fast and scalable for huge datasets while keeping memory in check?
We started with a basic question: What’s similarity anyway?
- For plagiarism detection, you want exact or near-exact matches.
- For content finding, semantic similarity matters more (e.g., "cat" vs. "feline").
This led us to a hybrid approach:
- SimHash for exact lookups (per the CLI spec).
- Vector similarity with random hyperplanes for fuzzy, scalable searches.
Diving In: The Big Idea from LSH
Our inspiration came from Moses Charikar’s paper, "Similarity Estimation Techniques from Rounding Algorithms". It introduced Locality-Sensitive Hashing (LSH)—a genius trick to fingerprint complex data (like text chunks) into short, comparable sketches.
The paper offered three flavors:
- Min-Hash: Great for detecting overlap (e.g., duplicates).
- Random Hyperplanes: Converts text into vectors (word counts), measures cosine similarity—perfect for meaning.
- Earth Mover Distance: A fancy distribution metric (overkill for us).
We chose random hyperplanes for its semantic power, pairing it with SimHash to meet the CLI’s -h <simhash_value>
requirement. Here’s how we stitched it together.
Building Jam-Text: The Tech Journey
Step 1: Chunking the Text
We started with a chunker package to split files into fixed-size chunks (default 4KB). Early versions were single-threaded—slow for big files. So, we added a worker pool, splitting the file into sections and processing them in parallel with goroutines.
chunker/chunker.go
func (c *Chunker) ProcessFile(filePath string) ([]Chunk, error) {
fileSize := getFileSize(filePath)
numWorkers := 4
chunkChan := make(chan Chunk, 100)
var wg sync.WaitGroup
sectionSize := fileSize / int64(numWorkers)
// Spin up workers
for i := 0; i < numWorkers; i++ {
start := int64(i) * sectionSize
end := start + sectionSize
if i == numWorkers-1 { end = fileSize }
wg.Add(1)
go c.processSection(filePath, start, end, chunkChan)
}
// Collect chunks
go func() { wg.Wait(); close(chunkChan) }()
var chunks []Chunk
for chunk := range chunkChan { chunks = append(chunks, chunk) }
return chunks, nil
}
Step 2: Fingerprinting with a Hybrid Twist
Next, we tackled fingerprinting in the simhash
package.
- SimHash hashes words into a 64-bit fingerprint—fast and compact.
- Vector similarity: Each chunk became a vector (word counts over a simple vocabulary), and we generated 10-bit sketches using random hyperplanes.
simhash/simhash.go
type Fingerprint struct {
SimHash uint64
VecSketches []int
}
func Hash(content []byte) Fingerprint {
vector := toVector(string(content))
return Fingerprint{
SimHash: simHash(string(content)),
VecSketches: vectorSketches(vector, 10),
}
}
Step 3: Indexing with LSH
The index
package stored SimHash for exact lookups and LSH bucketing for vector sketches.
index/index.go
type Index struct {
Entries map[uint64]IndexEntry
LSHBuckets map[int][]IndexEntry
mu sync.RWMutex
}
func (idx *Index) AddEntry(fp simhash.Fingerprint, offset int64) {
idx.mu.Lock()
defer idx.mu.Unlock()
idx.Entries[fp.SimHash] = IndexEntry{Fingerprint: fp, Offsets: []int64{offset}}
bucketKey := fp.VecSketches[0]
idx.LSHBuckets[bucketKey] = append(idx.LSHBuckets[bucketKey], entry)
}
Step 4: CLI Magic
The cli
package brought it to life with commands like index, lookup, compare, and fuzzy.
Example CLI usage
./textindex -c index -i large_text.txt -s 4096 -o index.idx
./textindex -c lookup -i index.idx -h 3eff1b2c98a6
./textindex -c compare -i doc1.txt -i2 doc2.txt -o report.txt
./textindex -c fuzzy -i testdata.idx -h $HASH -threshold 0.8
Challenges and Triumphs
The Vocabulary Headache
Vector similarity needed a vocabulary. Building it dynamically was slow, so we pre-sampled from the file. Not perfect, but it worked for the demo.
Balancing Speed and Memory
- SimHash was lean (8 bytes/chunk).
- 10 vector sketches bumped it to ~18 bytes/chunk.
- We tuned it down from 20 sketches to 10—good enough for fuzzy matching without hogging RAM.
Scaling with LSH
Linear search was... bad. LSH buckets improved recall and speed, but tuning the sketch size vs. false positives was tricky.
What’s Next?
We’re thinking about:
- Incremental Indexing for real-time updates.
- Approximate Nearest Neighbors (ANN) for even faster fuzzy searches.
- Persistent Storage (LSM trees, RocksDB?).
Top comments (0)