The tool bloat problem nobody talks about
Every AI agent platform hits the same wall: tool bloat.
An agent that can send emails, manage files, query databases, search the web, post to Slack, and call APIs needs a growing pile of tool definitions in its context window. Connect it to external systems, automation platforms, or MCP servers, and you've consumed a meaningful chunk of context before the agent has even started reasoning about the user's request.
The conventional fixes all have critical flaws:
| Approach | Problem |
|---|---|
| Load all tools | Breaks down as tool count grows — schemas and descriptions crowd out actual reasoning |
| Pre-filter by keyword | Fragile. “Send a message to John” — email? Slack? SMS? WhatsApp? Telegram? |
| Category menus | Pushes routing burden onto the user |
| Static tool sets per agent | Limits what each agent can do — defeats the point of a general platform |
The fundamental issue: the system decides which tools are relevant before the LLM has understood the user's intent. That's solving the wrong problem. Only the LLM knows what the user is actually asking for.
OpenPawz solves this with the Librarian Method — a technique that inverts tool discovery entirely.
Star the repo — it's open source
The invention: let the agent ask the librarian
The metaphor is literal. A library patron (the agent) walks up to a librarian and describes what they need. The librarian finds the right books. The patron never needs to know the filing system.
Three roles make this work:
| Role | Implementation |
|---|---|
| Patron | The LLM reasoning over the user's request |
| Librarian | An embedding-powered retrieval layer that maps intent to tools |
| Library | A searchable tool index built from tool definitions and domains |
The Patron understands intent. The Librarian searches for matching tools by semantic similarity. The Library stores every tool as an embedding vector organized by capability domain.
That means the agent only sees the tools it needs after it understands the task.
How it works — round by round
Round 1: User says "Email John about the quarterly report"
Agent has: a small core set of tools
Agent understands intent: needs email capabilities
Agent calls: request_tools("email sending capabilities")
Librarian embeds the request
Semantic search runs against the tool index
Top matches: email_send, email_read
Domain expansion pulls in closely related tools
Round 2: Tools are hot-loaded into the current turn
Agent now has: core tools + email tools
Agent calls: email_send({to: "john@...", subject: "Quarterly Report", ...})
Done ✅
The agent used only the tools it needed instead of dragging every available tool definition into the prompt.
Five design decisions that make it work
1. Agent-driven discovery
The LLM forms the search query — not a brittle pre-filter guessing from the raw user message.
When a user says "Can you check if the deployment went through?", a keyword filter might match deploy, check, or container. The agent understands the real intent is monitoring and calls something closer to request_tools("deployment status monitoring CI/CD").
That is a far better search query because it comes after reasoning, not before it.
2. Domain expansion
When the Librarian finds one strong match, it can also bring along closely related tools from the same domain.
If the agent finds email_send, it probably also needs email_read, contact lookup, or attachment handling. Related capabilities travel together so the agent doesn't need to repeatedly rediscover the same cluster.
3. Round carryover
Tools loaded in one reasoning round remain available in the next round of the same turn.
The agent doesn't lose access to the tools it just discovered, but the set also doesn't accumulate forever across unrelated turns.
4. Fallback layers
If semantic search is weak, the system still has multiple ways to recover:
- Exact name match
- Domain match
- Domain list return so the agent can refine its own request
The agent always gets something actionable back.
5. Memory-aware execution
Tool discovery alone is not enough.
Once an agent finds and uses the right tool, it still needs to remember what happened, what worked, what failed, and what should be reused later. That is where Engram enters the picture.
The Librarian answers: Which tool should I use?
Engram answers: What should I remember from using it?
Tool discovery alone is not enough
Most agent systems stop too early.
They focus on tool routing, but real agent behavior has three distinct problems:
| Problem | What it asks |
|---|---|
| Tool discovery | Which capability should the agent use right now? |
| Memory | What should the agent retain across turns and sessions? |
| Expertise | How does repeated success become something better than a prompt? |
OpenPawz treats these as separate layers:
| Layer | Purpose |
|---|---|
| The Librarian Method | Discover the right tools on demand |
| Project Engram | Give the agent structured, persistent memory |
| The Forge | Turn repeated procedural success into earned expertise |
That stack is the real idea.
Engram: memory that behaves like cognition, not a key-value dump
Most AI memory systems are still basically: store blobs, search blobs, inject blobs.
That works up to a point, but it has obvious failure modes:
| Flat memory model | Problem |
|---|---|
| Store everything the same way | No difference between facts, episodes, and procedures |
| Always retrieve | Wastes latency and pollutes context |
| Never forget | Outdated information lingers forever |
| No structure | Repeated experiences never become organized knowledge |
| No budget awareness | Memory recall competes blindly with the context window |
Project Engram is OpenPawz’s memory architecture for persistent agents.
Instead of treating memory like a bag of documents, Engram models memory as a living system with multiple layers:
- Sensory input for what just happened
- Working memory for what the agent is actively thinking about
- Long-term memory for what should persist across sessions
- Graph relationships so memories are connected, not isolated
- Consolidation and decay so the memory store improves over time instead of just growing forever
The memory model: from raw input to durable knowledge
At a high level, Engram works like this:
A user message enters a sensory buffer. Important items get promoted into working memory. Useful outcomes are captured into long-term memory. Later, when the agent needs context again, Engram decides what to retrieve and what to ignore.
That sounds simple, but the key is that memory is not just being stored — it is being ranked, consolidated, and filtered.
Three tiers, three jobs
Engram uses a three-tier memory architecture:
| Tier | Role | Purpose |
|---|---|---|
| Sensory Buffer | What just happened | Holds raw turn-level input before selection |
| Working Memory | What the agent is actively thinking about | Maintains a priority-limited attention set |
| Long-Term Memory | What should survive across sessions | Stores episodic, semantic, and procedural memory |
That separation matters because not every piece of information deserves the same lifespan.
A tool result from a minute ago may belong in working memory.
A learned user preference may belong in long-term semantic memory.
A successful multi-step workflow may belong in procedural memory.
Flat memory systems blur all of that together. Engram does not.
Retrieval should be gated, not automatic
Another failure mode in agent memory systems is that they retrieve memory for everything.
But not every query needs memory.
If the user asks for a calculation, a greeting, or something already covered in the active conversation, memory search is wasteful. It adds latency and pollutes the prompt with irrelevant context.
So Engram adds a retrieval gate.
The gate decides whether retrieval is needed at all. That means the system is not just asking:
“What memories match this query?”
It is first asking:
“Should I even search memory right now?”
That distinction matters more than it sounds.
Search is hybrid, not naive
Engram does not rely on a single retrieval method.
It combines multiple signals:
| Signal | Strength |
|---|---|
| Full-text search | Good for exact terms, identifiers, names, phrases |
| Vector similarity | Good for meaning, paraphrase, conceptual recall |
| Graph traversal | Good for connected ideas, related facts, causal links |
That lets the system answer different kinds of questions more intelligently.
- A factual query may weight lexical matches more heavily.
- A conceptual query may weight semantic similarity more heavily.
- A broader exploratory query may benefit from graph expansion.
This is what makes Engram more than “RAG but local.” It is memory retrieval shaped by query intent.
Memory should not just grow forever
A real memory system needs a theory of forgetting.
Without that, every stored fact competes forever for retrieval and context budget. Quality degrades because stale, duplicate, or low-value memories remain in circulation.
That is why Engram treats forgetting as a feature.
What forgetting means here
Forgetting in Engram is not random deletion. It is controlled memory maintenance:
- duplicates can be merged,
- contradictions can be resolved,
- stale low-value memories can fade,
- important memories can persist longer,
- and quality can be measured before and after cleanup.
This is one of the most important differences between a memory architecture and a document pile.
A pile only gets larger.
A memory system should get cleaner.
Memory is a graph, not a folder
Long-term memory in Engram is not just a list of rows. It is a graph.
| Edge Type | Meaning |
|---|---|
RelatedTo |
General association |
CausedBy |
Causal relationship |
Supports |
Supporting evidence |
Contradicts |
Conflicting knowledge |
PartOf |
Component or hierarchy relationship |
FollowedBy |
Temporal sequence |
DerivedFrom |
Origin or lineage |
SimilarTo |
Semantic similarity |
That structure matters because recall should not always stop at direct matches.
Sometimes the most useful thing is not the first memory you find — it is the memory connected to the first memory.
That is where graph-based retrieval becomes meaningful. The agent can move from direct hits to adjacent context instead of pretending every useful insight must be textually similar to the exact query.
Procedural memory is where things get interesting
Most memory systems focus on facts.
Engram also stores procedures — not just what is true, but how to do things.
That means OpenPawz can remember:
- how a deployment was fixed,
- how a file transformation worked,
- how an API issue was resolved,
- how a workflow was built successfully.
This turns memory from passive recall into active reuse.
A fact helps the agent answer.
A procedure helps the agent act.
That is the bridge from memory into expertise.
THE FORGE: specialists should earn expertise
Most AI platforms create “specialists” by stuffing a domain document into the prompt and calling it expertise.
That is not expertise. It is a cheat sheet.
A prompt-based specialist has obvious problems:
| Prompt specialist | Problem |
|---|---|
| Claims expertise | But has never been tested |
| Answers confidently | Even when the knowledge is stale |
| Looks specialized | But has no measurable boundary |
| Can be copied instantly | The whole “specialist” is often just a file |
FORGE is OpenPawz’s answer.
It extends Engram’s procedural memory so that repeatable workflows can move through a lifecycle:
That means procedures are not all equal.
Some are just memories.
Some are developing skills.
Some are skills the system can treat as trusted and reusable.
The moat is not the prompt
This is the deeper idea behind FORGE:
You can copy a prompt file.
You cannot copy accumulated, verified training cycles overnight.
That is a very different kind of defensibility.
If a system has gone through repeated tasks, retained successful procedures, linked them into skill relationships, measured confidence, and re-trained when things drift, then its expertise is not just text in a system prompt anymore.
It is embedded into the behavior of the system through memory, validation, and reuse.
That is much harder to fake.
How FORGE fits into Engram
FORGE does not create a separate storage system. It extends the memory system already there.
| Engram capability | FORGE uses it for |
|---|---|
| Procedural memory | Stores the procedures that can be trained and certified |
| Memory edges | Builds skill trees and prerequisite relationships |
| Trust / confidence signals | Distinguishes stronger skills from weaker ones |
| Decay and consolidation | Detects drift, staleness, and retraining candidates |
| Meta-cognition | Helps the agent know what it knows and what it does not |
That is an important design choice.
FORGE is not “yet another layer with duplicated storage.” It is training logic built on top of the same memory substrate.
What this changes in practice
Once you combine these pieces, the agent stops behaving like a thin wrapper around a prompt.
Without this stack
- Tools are either overloaded or under-available
- Memory retrieval is noisy or missing
- Learned workflows disappear between sessions
- Specialists are mostly branding
With this stack
- The agent discovers capabilities at the moment of need
- Useful outcomes persist across sessions
- Memory becomes cleaner instead of just larger
- Procedures can compound into reusable skills
- Specialization can be measured instead of merely declared
That is the real architecture shift.
This is the stack, not just a trick
The Librarian Method is useful on its own. But the bigger story is not just “better tool retrieval.”
It is this:
| Layer | Question it answers |
|---|---|
| The Librarian Method | Which tool should the agent use? |
| Project Engram | What should the agent remember? |
| The Forge | Which remembered procedures count as real expertise? |
That progression matters.
Tool retrieval solves capability access.
Memory solves continuity.
FORGE solves compounding competence.
That is what makes OpenPawz more interesting than a standard tools-plus-prompt system.
A concrete example
Imagine a user says:
“Check the GitHub issue, figure out why the workflow failed, and send me a summary.”
A conventional agent might:
- load too many tools,
- search memory poorly,
- and start from zero every time.
An OpenPawz agent can do something more structured:
- Use the Librarian Method to discover GitHub and messaging tools
- Execute the workflow investigation
- Store the findings in Engram
- Recall related failures later through hybrid search and graph links
- Reuse a previously successful troubleshooting procedure
- Eventually treat that procedure as validated expertise through FORGE
That is not just calling tools.
That is finding, remembering, and learning.
Why this is different from current approaches
A lot of systems optimize one piece in isolation:
| Approach | What it gets right | What it misses |
|---|---|---|
| Tool retrieval only | Better capability routing | No persistent memory, no compounding expertise |
| Basic RAG memory | Better recall than no memory | Flat storage, no forgetting, weak procedural learning |
| Prompt specialists | Fast to ship | No verification, no boundaries, no moat |
| Fine-tuning alone | Compresses behavior into weights | Harder to inspect, slower to update, weaker explicit skill tracking |
| OpenPawz stack | Tool discovery + memory + earned expertise | Treats agents as systems that should improve over time |
That is the deeper thesis:
The future agent is not just one that can call tools.
It is one that can find, remember, and earn.
Implementation
At a high level, these ideas show up across the engine like this:
| Area | Purpose |
|---|---|
tool_index |
Semantic tool retrieval and domain expansion |
request_tools |
Agent-facing meta-tool for hot-loading capabilities |
chat / agent loop |
Carry discovered tools across reasoning rounds |
engram/* |
Persistent memory, recall, consolidation, graph traversal |
procedural memory + FORGE metadata |
Verified skills, certification state, lineage, and re-training hooks |
The conceptual flow looks like this:
Agent requests the capabilities it actually needs
let tools = request_tools("workflow troubleshooting + GitHub + message follow-up", state);Agent uses the discovered tools
let result = run_with_tools(tools, user_request).await?;Engram stores the useful outcome as memory
engram.capture(result).await?;Repeated successful procedures can later be evaluated by FORGE
forge.evaluate_procedure_history().await?;`
Different layers. One system.
The bigger vision
The OpenPawz thesis is not that tools are enough.
It is that useful agents need three properties at the same time:
Dynamic capability access
The agent should not carry every tool all the time.Structured long-term memory
The agent should not forget everything between tasks.Compounding skill formation
The agent should not repeat the same learning curve forever.
That is why the Librarian Method, Engram, and FORGE belong together.
Try it
The Librarian Method is part of OpenPawz. The bigger idea is to pair it with memory and skill growth instead of treating tools as the whole system.
Ask an agent to do something capability-heavy:
“Check my GitHub notifications, summarize anything important, and message me if there’s a failing workflow.”
The agent can discover the right tools for the task.
Then, if the same pattern happens again later, Engram can help it start with memory instead of amnesia.
And if that procedure becomes well-tested and repeatable, FORGE is the layer that can eventually treat it as earned expertise rather than one-off luck.
Read the full specs
The technical references live in the repo:
Star the repo if you want to track progress. 🙏




Top comments (0)