Built an open-source memory layer for local LLMs — single-shot calls, auto-extracted constraints, no context degradation

Justin Joseph — Sat, 02 May 2026 18:06:07 +0000

Been running Llama 3.3 70B via Groq for coding tasks and kept losing architectural decisions across sessions. "We use PostgreSQL" — forgotten. "Auth is JWT" — re-debated. Every new chat starts from zero.
So I built steerhead — it sits between you and any OpenAI-compatible API and manages context via SQLite instead of chat history.
The trick: every message is a single-shot API call. Steerhead assembles the system prompt from stored constraints + file history, fires one clean call, then auto-extracts any decisions the model made (via a second LLM pass) and stores them for next time.
Result: 146 tokens of surgical context instead of 80K tokens of degrading conversation history. New session? The model still knows your entire project's decisions.
Works with:

Groq (free tier, tested with Llama 3.3 70B)
Ollama (local)
OpenRouter (free models)
Any OpenAI-compatible endpoint

What's there: project-scoped DBs, session persistence, auto constraint extraction, React UI
What's next: git diff capture, drift detection, memory classification (inspired by Cloudflare's Agent Memory announcement)
Stack: FastAPI + SQLite + React. Fully local, MIT licensed.
Looking for contributors — especially around constraint extraction accuracy and drift detection.
GitHub: https://github.com/josephmjustin/steerhead

DEV Community: Justin Joseph

Built an open-source memory layer for local LLMs — single-shot calls, auto-extracted constraints, no context degradation