Upgrading fallback AI model entries to curated quality with a deterministic hash pool

#webdev #programming #indiehackers #opensource

The AI tools directory launched with 380 model entries where model_used = 'fallback-template'. That tier generates a bare-bones summary — "qwen2-7b is an open-source text-generation model available on HuggingFace" — which is accurate but gives users nothing to act on.

Calling Claude Haiku for all 380 at once would cost a few dollars and flood the API. The three-tier content quality ladder caps Claude upgrades at 100 per run, spread across daily GitHub Actions jobs. At that pace, clearing the full backlog takes four days.

But there's another option for entries where AI-generated prose isn't strictly necessary: deterministic template enrichment. The scripts/polish.py script upgrades fallback entries using structured metadata already present in the model's HuggingFace tags — no API call, no AI involved.

What the Enrichment Adds

The HuggingFace model registry stores several fields the basic fallback ignores:

tags: a freeform list that often includes license:apache-2.0, pytorch, en, safetensors, gguf — the full schema is in the HuggingFace Hub docs
modelId: the full path like Qwen/Qwen2-7B, which encodes architecture

polish.py extracts structured facts from these fields at upgrade time:

def _license(tags: list) -> str | None:
    MAP = {"apache-2.0": "Apache 2.0", "mit": "MIT", "gpl-3.0": "GPL-3.0", ...}
    for t in tags:
        if t.startswith("license:"):
            return MAP.get(t[8:], t[8:])
    return None

def _frameworks(tags: list) -> list:
    MAP = {"pytorch": "PyTorch", "onnx": "ONNX", "gguf": "GGUF", "safetensors": "safetensors", ...}
    return [MAP[t] for t in tags if t in MAP]

def _langs(tags: list) -> list:
    MAP = {"en": "English", "zh": "Chinese", "ja": "Japanese", "multilingual": "multilingual", ...}
    return [MAP[t] for t in tags if t in MAP]

Architecture gets inferred from substring matching against the model ID. meta-llama/Meta-Llama-3-8B gets labeled "Llama 3"; sentence-transformers/all-MiniLM-L6-v2 gets "MiniLM"; openai/whisper-large-v3 gets "Whisper". The match list covers about 30 common architectures and falls back to a generic label for anything unrecognized.

The output for a polished entry looks like: "Qwen2-7B is an instruction-tuned text-generation model from Qwen, available in PyTorch and GGUF formats under the Apache 2.0 license, with multilingual support for English and Chinese." That's a summary worth rendering — specific enough that a user can evaluate it, distinct enough that two different models don't return identical copy.

Why Deterministic Instead of Random

Pool selection uses MD5 hash of the model name rather than a random seed:

def _seed(text: str) -> int:
    return int(hashlib.md5(text.encode()).hexdigest(), 16)

def pick(lst: list, seed_str: str):
    return lst[_seed(seed_str) % len(lst)]

The same model always gets the same template selection. Rerunning the script produces identical output. This matters for audits: if a summary changes between two runs without any template edits, something is wrong.

Random selection would produce different outputs on every run, making it impossible to distinguish "the content changed because we edited the template pool" from "the content changed because the seed landed differently." Deterministic selection removes that ambiguity entirely.

The 100-Entry Cap

fallbacks = [e for e in entries if e.get("model_used") != "claude-haiku-4-5"]
to_upgrade = fallbacks[:100]

Each run caps at 100 entries — first 100 non-Claude fallbacks in order. This keeps each GitHub Actions run under 30 seconds and avoids writing thousands of records in a single commit. The most recent run upgraded 98 entries; 282 remain. At one daily run, the backlog clears in about three days, at which point pipeline-aware content variants and Claude Haiku upgrades take entries the rest of the way.

What This Doesn't Replace

Hash-pool enrichment produces summaries that are accurate and specific. It does not produce summaries that explain what a model is distinctly good at compared to alternatives, or which deployment scenarios favor it. For that you need editorial judgment — which means Claude, not template expansion.

The isTemplateContent check in the model detail pages noindexes entries still on bare fallback content. Polish-upgraded entries have specific pros/cons populated, so they pass the check and become visible to Google. That's the practical value of this bridge step: moving pages into indexable territory without requiring Claude spend on every single entry.

Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.