DEV Community

Alex Towell
Alex Towell

Posted on • Originally published at metafunctor.com

Infinigram: Variable-Length N-grams via Suffix Arrays

Infinigram (pip install py-infinigram) is a corpus-based language model that uses suffix arrays for variable-length n-gram pattern matching. Unlike neural language models, there is no training step. The corpus is the model.

The problem with fixed n-grams

Traditional n-gram models use fixed context lengths and blow up exponentially. A 5-gram model over a 50,000-word vocabulary needs to store up to (50000^5) possible patterns. That is roughly 312 petabytes. Nobody does this.

Infinigram uses suffix arrays instead:

  • O(n) space: Linear in corpus size, not vocabulary size
  • O(m log n) queries: Fast pattern matching for any context length
  • Variable-length matching: Automatically uses the longest matching context

For a 1B token corpus, this means about 1GB instead of about 34GB for hash-based 5-grams.

How it works

Given a context, Infinigram finds the longest matching suffix in the training corpus:

from infinigram import Infinigram

corpus = [1, 2, 3, 4, 2, 3, 5, 6, 2, 3, 4]
model = Infinigram(corpus, max_length=10)

# Find longest match for context [2, 3]
position, length = model.longest_suffix([2, 3])

# Predict next token
probs = model.predict([2, 3])
# {4: 0.66, 5: 0.33, ...}
Enter fullscreen mode Exit fullscreen mode

Predictions come from counting what tokens follow the matched pattern in the corpus. Simple frequency estimation, but over arbitrarily long contexts.

LLM probability mixing

The practical application I care about most: grounding LLM outputs without retraining.

# Mix LLM with corpus-based predictions
P_final = alpha * P_llm + (1 - alpha) * P_infinigram
Enter fullscreen mode Exit fullscreen mode

This gives you:

  • Domain adaptation without fine-tuning. Load a legal corpus and you get legal-domain predictions.
  • Hallucination reduction by anchoring to actual corpus content.
  • Explainability. Every prediction traces to specific corpus evidence. You can point to the exact passages.

Projections as inductive biases

I wrote a theoretical framework viewing inductive biases as projections: transformations applied to queries or training data that enable generalization.

  • Runtime transforms: lowercase normalization, stemming, synonym expansion
  • Corpus augmentations: data augmentation, paraphrasing

This gives a principled way to think about out-of-distribution generalization in corpus-based models. The projection determines what the model treats as "the same."

Interactive REPL

Infinigram includes an interactive REPL for exploration:

infinigram-repl

infinigram> /dataset demo
infinigram [demo]> /load the cat sat on the mat
infinigram [demo]> /predict the cat
infinigram [demo]> /complete the cat --max 20
Enter fullscreen mode Exit fullscreen mode

Future: LangCalc integration

Infinigram is designed to work with LangCalc, an algebraic framework for composing language models:

# Compose models algebraically
model = 0.7 * llm + 0.2 * wiki_infinigram + 0.1 * code_infinigram
Enter fullscreen mode Exit fullscreen mode

Mix neural and corpus-based models with explicit control over domain influence.

Resources

Background

This builds on ideas from a SLUUG talk on LLMs where I demonstrated arbitrary-size n-grams using recursive dictionaries for a crude expression evaluator. Infinigram takes those ideas further with suffix arrays and the projection framework for generalization.

Top comments (0)