Four filters I apply when pulling HuggingFace models into an AI tools directory

#ai #machinelearning #opensource #indiehackers

HuggingFace's model hub has over 900,000 models as of mid-2026. Surfacing all of them on aiappdex.com would produce noise, not a directory. The ETL that runs nightly to update the AI tools directory applies four filters before any model is considered for inclusion. Here's what each filter does, what it catches that the previous filter missed, and what the chain still doesn't solve.

Filter 1: pipeline tag — only end-user-facing tasks

The HuggingFace API returns a pipeline_tag field on every model. The allowed set in my ETL is not the full HuggingFace taxonomy:

text-generation, text-classification, token-classification,
question-answering, summarization, translation, image-classification,
image-generation, image-to-text, text-to-image, automatic-speech-recognition,
text-to-speech, audio-classification, zero-shot-classification,
feature-extraction, sentence-similarity

What this excludes: image-segmentation, object-detection, depth-estimation, tabular-regression, and a dozen other computer-vision pipelines that are real ML tasks but not tools most directory visitors would search for. Also excluded: models with no pipeline_tag at all, which covers roughly 40% of the hub — mostly adapter weights, partial checkpoints, and fine-tune datasets uploaded alongside a model rather than the model itself.

This filter alone cuts the candidate set by roughly 80%. The resulting pool is still large enough to be useful — text-generation alone has hundreds of thousands of models — but it's a tractable number.

Filter 2: minimum likes threshold — filtering out abandoned and test uploads

Every model on HuggingFace has a likes count. The current threshold in the ETL is 30. Models below that get skipped entirely.

Thirty likes is a very low bar. What it actually catches: test uploads that were never intended for public use, deprecated model versions that were superseded by a renamed upload, and fine-tunes of popular base models trained on private datasets and uploaded without cleanup. These aren't useful directory entries — they don't have documentation, they often have incorrect or placeholder metadata, and they frequently return 404 on the download endpoint even though the API record exists.

The 30-like threshold isn't magic. It's the point where I stopped finding entries that were clearly accidental public uploads. Ten likes still produced a lot of noise; 30 produced much less. The threshold is a config value I can change per pipeline_tag if needed — text-generation models benefit from a higher threshold (I'd probably raise it to 50 for that pipeline if I wanted stricter quality) while rarer pipelines like text-to-speech work better with a lower cutoff because the ecosystem is smaller.

Filter 3: last-modified recency — flagging dormant models

This filter doesn't exclude models; it sets a low_activity flag on models that haven't been modified in 14 months. That flag gets stored in the Turso database and surfaced in the directory as a "last active" label rather than a hidden exclusion.

Why flag instead of exclude? Because an old model isn't necessarily a useless model. GPT-J 6B is from 2021 and still appears in production stacks. BERT-base-uncased is from 2019 and is used in half the fine-tuning tutorials published in 2026. Excluding by recency would misrepresent the landscape.

What the low-activity flag does catch: models that were announced with fanfare, got early likes, and then quietly went dormant when the authors moved to a new architecture. Without this flag, those models appear alongside actively maintained alternatives without any signal that maintenance stopped. For someone evaluating tools for production use, that distinction matters.

The 14-month threshold aligns with how HuggingFace itself handles model cards — a model page that hasn't been touched in 14 months almost certainly doesn't reflect the current state of the codebase or the upstream library it depends on.

Filter 4: gated and private models — inclusion requires public weights

Models marked gated: true on HuggingFace require login and a request form to download. Models marked private: true aren't downloadable at all. Neither appears in the directory.

The rationale isn't philosophical — it's practical. A directory entry for a model that visitors can't access without a form submission is a poor UX. The directory's value is "find a model, go use it." Gated models break that flow entirely for anyone who hasn't already been approved.

This filter has one real cost: it excludes some important models. Meta's Llama 3 series launched gated, as did several Mistral fine-tunes during their initial access period. The directory missed those for the window between public announcement and when gating was lifted. That's a real gap. The three-tier content quality ladder that handles model content upgrades would apply here too — if I wanted to include gated models with a "requires access request" label, I could add a tier for that. I've chosen not to for now because the "access request required" experience isn't what the directory is designed around.

What the chain still doesn't solve

These four filters produce a candidate set that's tractable and skewed toward useful entries. What they don't catch:

Duplicate fine-tunes. There are thousands of Llama-3.1-8B fine-tunes on HuggingFace, all passing the pipeline tag filter, all with enough likes, all public and actively maintained. The directory clusters them by base model but doesn't deduplicate in a meaningful way. Someone searching for an instruction-following model still faces a wall of variants.

Quality of the model card. A model that passes all four filters might have a model card that says "fine-tuned for [task]" with no further detail — no eval results, no intended use, no known limitations. The ETL can't infer quality from card text reliably. That's what Claude Haiku's editorial generation step handles: a prompted generation that forces structured outputs around audience fit and limitations. But it's worth naming that the ETL filters select for metadata quality, not model quality.

Pricing and deployment complexity. HuggingFace doesn't expose whether a model runs in 8GB of VRAM, requires a dedicated A100, or is practical to call via the Inference API without self-hosting. That data isn't in the API response. It's the kind of structured attribute that would make the directory genuinely useful for someone deciding between models — and it's something I'd want to add as a manual editorial field rather than an ETL-derived one.

Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.