VLSiddarth

Posted on Apr 7

Andrej Karpathy said manual data ingest for AI agents is too slow. I built the fix.

#webdev #ai #python #machinelearning

Andrej Karpathy said manual data ingest for AI agents is too slow. I built the fix.

Last week Andrej Karpathy posted about building personal knowledge
bases for LLM agents. He described his workflow: manually indexing
source documents into a raw/ directory, writing custom search tools,
building a naive search engine over his wiki.

Then he wrote this:

"I think there is room here for an incredible new product
instead of a hacky collection of scripts."

He was right. So I built it.

The Problem He Identified

Karpathy's workflow is brilliant but it requires him to manually
curate every source. He clips articles with Obsidian Web Clipper,
downloads images locally, feeds them one-by-one to his LLM agent.

For a researcher at his level that works. For a developer building
production AI agents for clients, it doesn't scale.

Here's the specific failure mode I kept hitting:

You build a RAG pipeline. It works. A user asks about a Python library.
Your retriever finds a Stack Overflow answer with cosine similarity 0.94.
The LLM answers confidently. The user follows the advice. It breaks their project.

The Stack Overflow answer was from 2021. The library changed its API in 2023.

Your retriever did its job perfectly. Your vector store had no concept
of when that document was written. No exception was raised. No warning
was shown. The cosine similarity score told you nothing about whether
the knowledge was still true.

This is the silent failure mode of every RAG pipeline in production.
Tavily, Exa, and SerpAPI don't tell you when their results are stale.

So I built a retrieval API that does.

What I Built

Knowledge Universe is an open-source retrieval API that gives
Karpathy's LLM wiki agents something they currently don't have:
a production-grade data ingestion layer that crawls 18 knowledge
sources simultaneously, scores every result for freshness, and
returns structured documents in 3 seconds.

# Install the CLI
pip install knowledge-universe

# Get a free API key
ku signup you@email.com

# Run your first query
ku discover "transformer architecture" --difficulty 3

Output:
Found 8 sources [2980ms]

🟢 [arxiv] Learning Novel Transformer Architecture for Time-series

https://arxiv.org/abs/2502.13721v1
decay=0.23 (fresh) quality=8.5/10
⚪ [kaggle] In-depth guide to Transformer architecture

https://www.kaggle.com/code/tientd95/in-depth-guide-to-transformer-arc
decay=0.40 (unknown) quality=4.5/10
🟢 [paperswithcode] Gradient Boosting within a Single Attention Layer

https://arxiv.org/abs/2604.03190
decay=0.01 (fresh) quality=7.6/10
🟠 [github] An-Jhon/Hand-Drawn-Transformer

https://github.com/An-Jhon/Hand-Drawn-Transformer
decay=0.68 (stale) quality=2.6/10
🟡 [semantic_scholar] Transformer+transformer architecture for image captioni

https://doi.org/10.11591/ijai.v14.i3.pp2338-2346
decay=0.44 (aging) quality=5.1/10
🟢 [arxiv] A Survey of Graph Transformers: Architectures, Theories

https://arxiv.org/abs/2502.16533v2
decay=0.23 (fresh) quality=8.9/10
⚪ [kaggle] LB 0.73 single fold transformer architecture

https://www.kaggle.com/code/hengck23/lb-0-73-single-fold-transformer-a
decay=0.40 (unknown) quality=4.5/10
🟠 [github] tum-pbs/pde-transformer

https://github.com/tum-pbs/pde-transformer
decay=0.70 (stale) quality=2.5/10

Cache hit: False | Time: 2980ms

Every result tells you not just what it found, but how much to trust it.

Live API: https://vlsiddarth-knowledge-universe.hf.space

GitHub: https://github.com/VLSiddarth/Knowledge-Universe

The Architecture

The core idea: run 18 crawlers in parallel, score everything,
return the best 8-10 results with freshness metadata attached.

Your App / Agent
          │
          ▼  POST /v1/discover
┌─────────────────────────────────────────────────┐
│            Knowledge Universe API               │
│                                                 │
│  1. Cache check (Redis) ──── HIT → 200ms return │
│        │ MISS                                   │
│  2. asyncio.gather(18 crawlers, per-timeouts)   │
│        ├── arXiv          (25s timeout)         │
│        ├── CrossRef       (8s) ← [Academic]     │
│        ├── PapersWithCode (8s) ← [SOTA Models]  │
│        ├── Documentation  (3s) ← [Fast-Fail]    │
│        ├── GitHub         (8s)                  │
│        ├── StackOverflow  (6s)                  │
│        ├── HuggingFace    (8s)                  │
│        ├── Kaggle         (6s)                  │
│        ├── YouTube        (8s)                  │
│        ├── Sketchfab      (5s) ← [3D Spatial]   │
│        ├── Freesound      (5s) ← [Audio]        │
│        ├── Wikipedia      (5s)                  │
│        ├── MIT OCW        (5s)                  │
│        ├── OpenLibrary    (5s)                  │
│        ├── Podcast Index  (5s)                  │
│        ├── Libgen         (4s)                  │
│        ├── CommonCrawl    (2s) ← [Fast-Fail]    │
│        └── GH Archive     (2s) ← [Fast-Fail]    │
│        │                                        │
│  3. Semantic pre-filter (cosine sim > 0.25)     │
│  4. Quality ranker (5-dimension scoring)        │
│  5. Knowledge Decay Engine                      │
│  6. LLM reranker (all-MiniLM-L6-v2)             │
│  7. Coverage Confidence Score                   │
│  8. Cache result (Redis, 4h TTL)                │
│        │                                        │
│  Return: sources + decay_scores + confidence    │
└─────────────────────────────────────────────────┘

Why asyncio.gather and not threads?

Each crawler is an async HTTP call. asyncio.gather runs all 18
simultaneously, so the total wall time equals the slowest crawler,
not the sum. The parallel ceiling is arXiv at ~2.5s for complex queries.

One lesson that cost me 5 seconds of latency: one blocking crawler
kills everything. My original Kaggle integration used the official
SDK which runs synchronous urllib3 under the hood. I wrapped it in
run_in_executor thinking that was fine. It held a thread pool slot
for 2.5 seconds on every query and pushed cold latency from 3s to 8s.

The fix: replace the SDK with direct async HTTP calls using httpx:

# Before — blocks the thread pool
async def crawl(self, topic, difficulty):
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(None, self._crawl_sync, topic, difficulty)

# After — true async, 300ms for same results
async def crawl(self, topic, difficulty):
    async with httpx.AsyncClient() as client:
        datasets = await client.get(
            "https://www.kaggle.com/api/v1/datasets/list",
            params={"search": topic, "sortBy": "usability"},
            headers={"Authorization": f"Basic {self._encoded_creds}"}
        )
        return self._parse(datasets.json(), difficulty)

Cold latency dropped from 8.8s to 3.1s with that one change.

The Knowledge Decay Formula

This is the part that doesn't exist in any other retrieval API.

Every result gets a decay score computed from its age and source type:
decay = 1 - 0.5 ^ (age_days / half_life)
freshness = 1 - decay

Half-lives are tuned per platform based on how fast knowledge
in that domain becomes outdated:

Platform	Half-life	Why
HuggingFace	120 days	ML model landscape changes monthly
GitHub	180 days	Dependencies update constantly
YouTube	270 days	Library tutorials date quickly
Stack Overflow	365 days	API answers age with framework versions
arXiv	1,095 days	Research papers have longer shelf life
Wikipedia	1,460 days	Actively maintained, slow decay
Open Library	1,825 days	Books revised infrequently

The output on every result:

"stackoverflow:59523557": {
  "decay_score": 0.986,
  "freshness": 0.014,
  "label": "decayed",
  "age_days": 2263,
  "half_life_days": 365
}

That Stack Overflow answer from 2021 with cosine similarity 0.94?
Its freshness score is 0.014. It gets downweighted before it
reaches your LLM. The silent failure mode is no longer silent.

Fast-moving topics get an additional volatility multiplier.
Topics like "LLMs", "React", "Docker", "Claude" use a ×1.1
multiplier on the decay rate — knowledge in those domains
goes stale faster than the platform average.

The 5-Dimension Quality Ranker

Before decay is applied, each source gets a base quality score:

WEIGHTS = {
    "authority":            0.35,  # platform trust + content type
    "difficulty_alignment": 0.30,  # how well difficulty matches request
    "completeness":         0.20,  # metadata richness
    "social_proof":         0.10,  # stars, citations, views (log-scaled)
    "accessibility":        0.05,  # open access bonus
}

final_score = base_quality * decay_penalty_multiplier

Difficulty alignment is the most impactful dimension for
Karpathy's use case specifically. When you're feeding an LLM wiki
agent, you want sources matched to the agent's context level.
A research synthesis agent should get arXiv papers, not YouTube
explainers. A learning tool for beginners should get the opposite.

# difficulty gap penalty
# gap=0: score 10.0  (perfect match)
# gap=1: score 8.5   (acceptable)  
# gap=2: score 6.0   (marginal)
# gap=3: score 2.0   (nearly blocked)

The Coverage Confidence Score

This is the feature that surprised me most when I built it.

After reranking, the API computes the average cosine similarity
between your query and the top results. If the average falls below
0.45, it warns you and suggests better queries:

"coverage_intelligence": {
  "confidence": 0.36,
  "confidence_label": "low",
  "coverage_warning": true,
  "warning_message": "Low confidence — results may not match intent",
  "suggested_queries": [
    "attention mechanism self-attention explained",
    "transformer encoder decoder architecture tutorial",
    "attention is all you need paper walkthrough"
  ]
}

This matters for LLM wiki agents specifically. When the agent asks
an ambiguous question that doesn't match how sources are indexed,
instead of silently returning mediocre results, the API tells the
agent to rephrase. The agent can use the suggested queries directly.

Karpathy mentioned he runs "health checks" over his wiki to find
inconsistent data and impute missing data. Coverage confidence is
essentially that health check, automated and running on every query.

Performance vs Competitors

Tested against Tavily, Exa, and SerpAPI using identical queries:

Metric	Knowledge Universe	Tavily	Exa	SerpAPI
Cold latency	3.1s	5.4s	1.5s	3.5s
Cache hit	220ms	N/A	N/A	N/A
Decay scoring	✅	✗	✗	✗
Confidence score	✅	✗	✗	✗
Difficulty ranking	✅	✗	✗	✗
Source diversity	18 typed platforms (incl. 3D & Audio)	Web only	Web only	Google only

KU is faster than Tavily on cold queries despite hitting 18 typed
sources vs Tavily's general web index. The parallel architecture
is what makes this possible — the wall clock time equals the
slowest single crawler, not the sum of all crawlers.

Note on Exa: Exa is faster (1.5s) because it uses a single unified
search index rather than parallel crawling. The tradeoff is no decay
scoring and no source type diversity — you get whatever their index
decided to rank.

LangChain Integration — Drop-in Ready

import requests
from langchain_core.documents import Document

def get_knowledge_universe_docs(
    topic: str,
    difficulty: int = 3,
    formats: list = None,
    api_key: str = "your_key_here",
    min_freshness: float = 0.3,  # filter sources below 30% fresh
) -> list[Document]:

    formats = formats or ["pdf", "github", "html", "video", "stackoverflow"]

    resp = requests.post(
        "https://vlsiddarth-knowledge-universe.hf.space/v1/discover",
        headers={"X-API-Key": api_key},
        json={
            "topic": topic,
            "difficulty": difficulty,
            "formats": formats,
            "max_results": 10,
        },
        timeout=30,
    ).json()

    # Check coverage confidence
    cov = resp.get("coverage_intelligence", {})
    if cov.get("coverage_warning"):
        print(f"⚠️  Low confidence. Try: {cov.get('suggested_queries', [])}")

    docs = []
    decay_map = resp.get("decay_scores", {})

    for source in resp.get("sources", []):
        decay = decay_map.get(source["id"], {})
        freshness = decay.get("freshness", 0.5)

        # Filter stale sources before they reach your LLM
        if freshness < min_freshness:
            continue

        docs.append(Document(
            page_content=source.get("summary", ""),
            metadata={
                "title":         source["title"],
                "url":           source["url"],
                "platform":      source["source_platform"],
                "freshness":     freshness,
                "decay_label":   decay.get("label"),
                "quality_score": source.get("quality_score"),
                "difficulty":    source.get("difficulty"),
            }
        ))

    return docs


# Usage — drop into any existing LangChain RAG chain
docs = get_knowledge_universe_docs(
    topic="transformer architecture",
    difficulty=3,
    api_key="ku_test_...",
)

for doc in docs:
    print(f"[{doc.metadata['decay_label']}] {doc.metadata['title']}")
    print(f"  freshness={doc.metadata['freshness']:.2f}")
    print(f"  url={doc.metadata['url']}")

For Karpathy's LLM wiki use case specifically:

# Give your LLM wiki agent a tool that does the manual ingest
# he described — automatically, with freshness scoring

wiki_sources = get_knowledge_universe_docs(
    topic="mixture of experts routing algorithms",
    difficulty=5,           # researcher-level sources only
    min_freshness=0.5,      # only recent sources go into the wiki
    formats=["pdf", "github"]  # papers and implementations
)

# Feed directly to your wiki agent
for source in wiki_sources:
    agent.ingest_to_wiki(source)

The manual raw/ directory collection Karpathy describes is now
three lines of code.

Things That Didn't Work

MinHash LSH deduplication misses near-identical titles.
Wikipedia returns both "Neural network" and "Neural network
(machine learning)" as separate articles. After normalization
they differ, so both pass deduplication. Fixed with a
parenthetical-stripping step before the hash comparison.

Per-crawler timeouts were the wrong abstraction initially.
I started with a global 8s timeout for all crawlers.
CommonCrawl and GHArchive always time out at 8s with 0 results,
wasting the full 8s on every query. Setting them to 2s fast-fail
dropped the parallel ceiling from 8s to 3s.

Semantic Scholar blocked Hugging Face IP addresses.
I originally used Semantic Scholar for academic papers. In late 2024, they changed their policy and started throwing 403 Forbidden errors for server-to-server requests from Hugging Face Spaces free tiers.

The Fix: I ripped it out and integrated CrossRef. They index 150M+ scholarly works, have a fully open API (CC0 metadata), and actively encourage programmatic access via their "polite pool" (just pass your email in the User-Agent). It gave me access to IEEE, ACM, and Nature papers that arXiv misses, with zero rate-limit blocks.

Lesson: profile each crawler individually before setting any
global timeout.

_is_stale() was silently killing cache hit rate.
The function checked if a cached result was older than 80% of
the TTL (14,400s × 0.8 = 11,520s). Any query between 3.2 and
4 hours old triggered a full cold re-crawl even though Redis
still had the result. Cache hit rate was 25%. One-line fix:
use the full TTL. Hit rate went to 50%+ immediately.

The shared model singleton took too long to notice.
Both LocalLLMReranker and CoverageConfidenceScorer were
independently loading all-MiniLM-L6-v2 (90MB, ~300MB RAM).
Loading twice pushed HuggingFace Spaces free tier (2GB limit)
near the ceiling and added ~500ms to first requests. Fixed with
a module-level singleton:

# src/integrations/shared_model.py
_model = None
_model_lock = threading.Lock()

def get_shared_model():
    global _model
    if _model is not None:
        return _model
    with _model_lock:
        if _model is None:
            from sentence_transformers import SentenceTransformer
            _model = SentenceTransformer("all-MiniLM-L6-v2")
    return _model

Both classes now call get_shared_model(). Model loads once.
Shared embeddings from the reranker pass directly to the
confidence scorer — zero extra encode() calls per request.

Try It Now

# Install the Python SDK
pip install knowledge-universe

# Or Node
npm install knowledge-universe

# Get a free API key (500 calls/month, no credit card)
ku signup you@email.com

# Query the live API directly
curl -X POST https://vlsiddarth-knowledge-universe.hf.space/v1/discover \
  -H "X-API-Key: ku_test_..." \
  -H "Content-Type: application/json" \
  -d '{
    "topic": "transformer architecture",
    "difficulty": 3,
    "formats": ["pdf", "github", "html", "video"],
    "max_results": 10
  }'

Live API + Swagger docs:

https://vlsiddarth-knowledge-universe.hf.space

GitHub (MIT licensed):

https://github.com/VLSiddarth/Knowledge-Universe

What's Next

Two things I'm actively building:

1. Streaming results — return sources as they arrive from
each crawler rather than waiting for all 18. The first 3 results
could be in your agent pipeline within 800ms. You see something
immediately; the pipeline enriches as more crawlers complete.

2. /v1/monitor webhook alerts — register a topic and a
webhook URL. Knowledge Universe checks that topic daily. When
freshness drops below a threshold or a significantly better
source appears, it pushes an update to your endpoint. Your
RAG pipeline stays current without polling.

If you're building LLM agents that need external knowledge —
whether it's Karpathy's wiki pattern, a production RAG pipeline,
or something in between — I'd genuinely like to hear what breaks.

What's your current approach to handling source freshness in
retrieval? Drop it in the comments.

Tags: python, ai, machinelearning, webdev

DEV Community

Andrej Karpathy said manual data ingest for AI agents is too slow. I built the fix.

Andrej Karpathy said manual data ingest for AI agents is too slow. I built the fix.

The Problem He Identified

What I Built

The Architecture

The Knowledge Decay Formula

The 5-Dimension Quality Ranker

The Coverage Confidence Score

Performance vs Competitors

LangChain Integration — Drop-in Ready

Things That Didn't Work

Try It Now

What's Next

Top comments (0)