Infinigram: Variable-Length N-grams via Suffix Arrays

#languagemodels #ngram #suffixarrays #nlp

Infinigram (pip install py-infinigram) is a corpus-based language model that uses suffix arrays for variable-length n-gram pattern matching. Unlike neural language models, there is no training step. The corpus is the model.

The problem with fixed n-grams

Traditional n-gram models use fixed context lengths and blow up exponentially. A 5-gram model over a 50,000-word vocabulary needs to store up to (50000^5) possible patterns. That is roughly 312 petabytes. Nobody does this.

Infinigram uses suffix arrays instead:

O(n) space: Linear in corpus size, not vocabulary size
O(m log n) queries: Fast pattern matching for any context length
Variable-length matching: Automatically uses the longest matching context

For a 1B token corpus, this means about 1GB instead of about 34GB for hash-based 5-grams.

How it works

Given a context, Infinigram finds the longest matching suffix in the training corpus:

from infinigram import Infinigram

corpus = [1, 2, 3, 4, 2, 3, 5, 6, 2, 3, 4]
model = Infinigram(corpus, max_length=10)

# Find longest match for context [2, 3]
position, length = model.longest_suffix([2, 3])

# Predict next token
probs = model.predict([2, 3])
# {4: 0.66, 5: 0.33, ...}

Predictions come from counting what tokens follow the matched pattern in the corpus. Simple frequency estimation, but over arbitrarily long contexts.

LLM probability mixing

The practical application I care about most: grounding LLM outputs without retraining.

# Mix LLM with corpus-based predictions
P_final = alpha * P_llm + (1 - alpha) * P_infinigram

This gives you:

Domain adaptation without fine-tuning. Load a legal corpus and you get legal-domain predictions.
Hallucination reduction by anchoring to actual corpus content.
Explainability. Every prediction traces to specific corpus evidence. You can point to the exact passages.

Projections as inductive biases

I wrote a theoretical framework viewing inductive biases as projections: transformations applied to queries or training data that enable generalization.

Runtime transforms: lowercase normalization, stemming, synonym expansion
Corpus augmentations: data augmentation, paraphrasing

This gives a principled way to think about out-of-distribution generalization in corpus-based models. The projection determines what the model treats as "the same."

Interactive REPL

Infinigram includes an interactive REPL for exploration:

infinigram-repl

infinigram> /dataset demo
infinigram [demo]> /load the cat sat on the mat
infinigram [demo]> /predict the cat
infinigram [demo]> /complete the cat --max 20

Future: LangCalc integration

Infinigram is designed to work with LangCalc, an algebraic framework for composing language models:

# Compose models algebraically
model = 0.7 * llm + 0.2 * wiki_infinigram + 0.1 * code_infinigram

Mix neural and corpus-based models with explicit control over domain influence.

Resources

PyPI: pip install py-infinigram
GitHub: github.com/queelius/infinigram
Paper: Infinigram Technical Report
Related: LangCalc Project

Background

This builds on ideas from a SLUUG talk on LLMs where I demonstrated arbitrary-size n-grams using recursive dictionaries for a crude expression evaluator. Infinigram takes those ideas further with suffix arrays and the projection framework for generalization.