Isartor: pure‑Rust prompt firewall that claims to deflect 60–95% of LLM calls using semantic caching + an embedded SLM. Self‑hosted, air‑gapped. Read the repo: https://github.com/isartor-ai/Isartor (docs: https://isartor-ai.github.io/Isartor/)
How it works: sits between agents and the cloud LLM, computes embeddings, checks a semantic cache, and either returns a cached answer or runs a tiny local SLM (Candle/HuggingFace). Pure Rust runtime — no external infra required.
Where it wins: repetitive agentic traffic — status checks, deterministic tool calls, repeated retrieval prompts. Authors report 60–95% deflection. Tradeoffs: local compute for SLMs, storage for embeddings, and false positives — you need thresholds, eviction, and metrics.
Practical test I’d run: replay 30 days of agent logs, simulate cache hits, pick an embedding threshold that keeps false matches <1%, measure cost vs cloud spend. Would you deploy this at the gateway or per‑agent runner?
Top comments (0)