Gotham64

Posted on Mar 7

The Librarian Method: How OpenPawz solves tool bloat — and why memory matters

#ai #opensource #agents #memory

The tool bloat problem nobody talks about

Every AI agent platform hits the same wall: tool bloat.

An agent that can send emails, manage files, query databases, search the web, post to Slack, and call APIs needs a growing pile of tool definitions in its context window. Connect it to external systems, automation platforms, or MCP servers, and you've consumed a meaningful chunk of context before the agent has even started reasoning about the user's request.

The conventional fixes all have critical flaws:

Approach	Problem
Load all tools	Breaks down as tool count grows — schemas and descriptions crowd out actual reasoning
Pre-filter by keyword	Fragile. “Send a message to John” — email? Slack? SMS? WhatsApp? Telegram?
Category menus	Pushes routing burden onto the user
Static tool sets per agent	Limits what each agent can do — defeats the point of a general platform

The fundamental issue: the system decides which tools are relevant before the LLM has understood the user's intent. That's solving the wrong problem. Only the LLM knows what the user is actually asking for.

OpenPawz solves this with the Librarian Method — a technique that inverts tool discovery entirely.

Star the repo — it's open source

The invention: let the agent ask the librarian

The metaphor is literal. A library patron (the agent) walks up to a librarian and describes what they need. The librarian finds the right books. The patron never needs to know the filing system.

Three roles make this work:

Role	Implementation
Patron	The LLM reasoning over the user's request
Librarian	An embedding-powered retrieval layer that maps intent to tools
Library	A searchable tool index built from tool definitions and domains

The Patron understands intent. The Librarian searches for matching tools by semantic similarity. The Library stores every tool as an embedding vector organized by capability domain.

That means the agent only sees the tools it needs after it understands the task.

How it works — round by round

Round 1: User says "Email John about the quarterly report"

  Agent has: a small core set of tools
  Agent understands intent: needs email capabilities
  Agent calls: request_tools("email sending capabilities")

  Librarian embeds the request
  Semantic search runs against the tool index
  Top matches: email_send, email_read
  Domain expansion pulls in closely related tools

Round 2: Tools are hot-loaded into the current turn

  Agent now has: core tools + email tools
  Agent calls: email_send({to: "john@...", subject: "Quarterly Report", ...})
  Done ✅

The agent used only the tools it needed instead of dragging every available tool definition into the prompt.

Five design decisions that make it work

1. Agent-driven discovery

The LLM forms the search query — not a brittle pre-filter guessing from the raw user message.

When a user says "Can you check if the deployment went through?", a keyword filter might match deploy, check, or container. The agent understands the real intent is monitoring and calls something closer to request_tools("deployment status monitoring CI/CD").

That is a far better search query because it comes after reasoning, not before it.

2. Domain expansion

When the Librarian finds one strong match, it can also bring along closely related tools from the same domain.

If the agent finds email_send, it probably also needs email_read, contact lookup, or attachment handling. Related capabilities travel together so the agent doesn't need to repeatedly rediscover the same cluster.

3. Round carryover

Tools loaded in one reasoning round remain available in the next round of the same turn.

The agent doesn't lose access to the tools it just discovered, but the set also doesn't accumulate forever across unrelated turns.

4. Fallback layers

If semantic search is weak, the system still has multiple ways to recover:

Exact name match
Domain match
Domain list return so the agent can refine its own request

The agent always gets something actionable back.

5. Memory-aware execution

Tool discovery alone is not enough.

Once an agent finds and uses the right tool, it still needs to remember what happened, what worked, what failed, and what should be reused later. That is where Engram enters the picture.

The Librarian answers: Which tool should I use?
Engram answers: What should I remember from using it?

Tool discovery alone is not enough

Most agent systems stop too early.

They focus on tool routing, but real agent behavior has three distinct problems:

Problem	What it asks
Tool discovery	Which capability should the agent use right now?
Memory	What should the agent retain across turns and sessions?
Expertise	How does repeated success become something better than a prompt?

OpenPawz treats these as separate layers:

Layer	Purpose
The Librarian Method	Discover the right tools on demand
Project Engram	Give the agent structured, persistent memory
The Forge	Turn repeated procedural success into earned expertise

That stack is the real idea.

Engram: memory that behaves like cognition, not a key-value dump

Most AI memory systems are still basically: store blobs, search blobs, inject blobs.

That works up to a point, but it has obvious failure modes:

Flat memory model	Problem
Store everything the same way	No difference between facts, episodes, and procedures
Always retrieve	Wastes latency and pollutes context
Never forget	Outdated information lingers forever
No structure	Repeated experiences never become organized knowledge
No budget awareness	Memory recall competes blindly with the context window

Project Engram is OpenPawz’s memory architecture for persistent agents.

Instead of treating memory like a bag of documents, Engram models memory as a living system with multiple layers:

Sensory input for what just happened
Working memory for what the agent is actively thinking about
Long-term memory for what should persist across sessions
Graph relationships so memories are connected, not isolated
Consolidation and decay so the memory store improves over time instead of just growing forever

The memory model: from raw input to durable knowledge

At a high level, Engram works like this:

A user message enters a sensory buffer. Important items get promoted into working memory. Useful outcomes are captured into long-term memory. Later, when the agent needs context again, Engram decides what to retrieve and what to ignore.

That sounds simple, but the key is that memory is not just being stored — it is being ranked, consolidated, and filtered.

Three tiers, three jobs

Engram uses a three-tier memory architecture:

Tier	Role	Purpose
Sensory Buffer	What just happened	Holds raw turn-level input before selection
Working Memory	What the agent is actively thinking about	Maintains a priority-limited attention set
Long-Term Memory	What should survive across sessions	Stores episodic, semantic, and procedural memory

That separation matters because not every piece of information deserves the same lifespan.

A tool result from a minute ago may belong in working memory.
A learned user preference may belong in long-term semantic memory.
A successful multi-step workflow may belong in procedural memory.

Flat memory systems blur all of that together. Engram does not.

Retrieval should be gated, not automatic

Another failure mode in agent memory systems is that they retrieve memory for everything.

But not every query needs memory.

If the user asks for a calculation, a greeting, or something already covered in the active conversation, memory search is wasteful. It adds latency and pollutes the prompt with irrelevant context.

So Engram adds a retrieval gate.

The gate decides whether retrieval is needed at all. That means the system is not just asking:

“What memories match this query?”

It is first asking:

“Should I even search memory right now?”

That distinction matters more than it sounds.

Search is hybrid, not naive

Engram does not rely on a single retrieval method.

It combines multiple signals:

Signal	Strength
Full-text search	Good for exact terms, identifiers, names, phrases
Vector similarity	Good for meaning, paraphrase, conceptual recall
Graph traversal	Good for connected ideas, related facts, causal links

That lets the system answer different kinds of questions more intelligently.

A factual query may weight lexical matches more heavily.
A conceptual query may weight semantic similarity more heavily.
A broader exploratory query may benefit from graph expansion.

This is what makes Engram more than “RAG but local.” It is memory retrieval shaped by query intent.

Memory should not just grow forever

A real memory system needs a theory of forgetting.

Without that, every stored fact competes forever for retrieval and context budget. Quality degrades because stale, duplicate, or low-value memories remain in circulation.

That is why Engram treats forgetting as a feature.

What forgetting means here

Forgetting in Engram is not random deletion. It is controlled memory maintenance:

duplicates can be merged,
contradictions can be resolved,
stale low-value memories can fade,
important memories can persist longer,
and quality can be measured before and after cleanup.

This is one of the most important differences between a memory architecture and a document pile.

A pile only gets larger.
A memory system should get cleaner.

Memory is a graph, not a folder

Long-term memory in Engram is not just a list of rows. It is a graph.

Edge Type	Meaning
`RelatedTo`	General association
`CausedBy`	Causal relationship
`Supports`	Supporting evidence
`Contradicts`	Conflicting knowledge
`PartOf`	Component or hierarchy relationship
`FollowedBy`	Temporal sequence
`DerivedFrom`	Origin or lineage
`SimilarTo`	Semantic similarity

That structure matters because recall should not always stop at direct matches.

Sometimes the most useful thing is not the first memory you find — it is the memory connected to the first memory.

That is where graph-based retrieval becomes meaningful. The agent can move from direct hits to adjacent context instead of pretending every useful insight must be textually similar to the exact query.

Procedural memory is where things get interesting

Most memory systems focus on facts.

Engram also stores procedures — not just what is true, but how to do things.

That means OpenPawz can remember:

how a deployment was fixed,
how a file transformation worked,
how an API issue was resolved,
how a workflow was built successfully.

This turns memory from passive recall into active reuse.

A fact helps the agent answer.
A procedure helps the agent act.

That is the bridge from memory into expertise.

THE FORGE: specialists should earn expertise

Most AI platforms create “specialists” by stuffing a domain document into the prompt and calling it expertise.

That is not expertise. It is a cheat sheet.

A prompt-based specialist has obvious problems:

Prompt specialist	Problem
Claims expertise	But has never been tested
Answers confidently	Even when the knowledge is stale
Looks specialized	But has no measurable boundary
Can be copied instantly	The whole “specialist” is often just a file

FORGE is OpenPawz’s answer.

It extends Engram’s procedural memory so that repeatable workflows can move through a lifecycle:

That means procedures are not all equal.

Some are just memories.
Some are developing skills.
Some are skills the system can treat as trusted and reusable.

The moat is not the prompt

This is the deeper idea behind FORGE:

You can copy a prompt file.
You cannot copy accumulated, verified training cycles overnight.

That is a very different kind of defensibility.

If a system has gone through repeated tasks, retained successful procedures, linked them into skill relationships, measured confidence, and re-trained when things drift, then its expertise is not just text in a system prompt anymore.

It is embedded into the behavior of the system through memory, validation, and reuse.

That is much harder to fake.

How FORGE fits into Engram

FORGE does not create a separate storage system. It extends the memory system already there.

Engram capability	FORGE uses it for
Procedural memory	Stores the procedures that can be trained and certified
Memory edges	Builds skill trees and prerequisite relationships
Trust / confidence signals	Distinguishes stronger skills from weaker ones
Decay and consolidation	Detects drift, staleness, and retraining candidates
Meta-cognition	Helps the agent know what it knows and what it does not

That is an important design choice.

FORGE is not “yet another layer with duplicated storage.” It is training logic built on top of the same memory substrate.

What this changes in practice

Once you combine these pieces, the agent stops behaving like a thin wrapper around a prompt.

Without this stack

Tools are either overloaded or under-available
Memory retrieval is noisy or missing
Learned workflows disappear between sessions
Specialists are mostly branding

With this stack

The agent discovers capabilities at the moment of need
Useful outcomes persist across sessions
Memory becomes cleaner instead of just larger
Procedures can compound into reusable skills
Specialization can be measured instead of merely declared

That is the real architecture shift.

This is the stack, not just a trick

The Librarian Method is useful on its own. But the bigger story is not just “better tool retrieval.”

It is this:

Layer	Question it answers
The Librarian Method	Which tool should the agent use?
Project Engram	What should the agent remember?
The Forge	Which remembered procedures count as real expertise?

That progression matters.

Tool retrieval solves capability access.
Memory solves continuity.
FORGE solves compounding competence.

That is what makes OpenPawz more interesting than a standard tools-plus-prompt system.

A concrete example

Imagine a user says:

“Check the GitHub issue, figure out why the workflow failed, and send me a summary.”

A conventional agent might:

load too many tools,
search memory poorly,
and start from zero every time.

An OpenPawz agent can do something more structured:

Use the Librarian Method to discover GitHub and messaging tools
Execute the workflow investigation
Store the findings in Engram
Recall related failures later through hybrid search and graph links
Reuse a previously successful troubleshooting procedure
Eventually treat that procedure as validated expertise through FORGE

That is not just calling tools.
That is finding, remembering, and learning.

Why this is different from current approaches

A lot of systems optimize one piece in isolation:

Approach	What it gets right	What it misses
Tool retrieval only	Better capability routing	No persistent memory, no compounding expertise
Basic RAG memory	Better recall than no memory	Flat storage, no forgetting, weak procedural learning
Prompt specialists	Fast to ship	No verification, no boundaries, no moat
Fine-tuning alone	Compresses behavior into weights	Harder to inspect, slower to update, weaker explicit skill tracking
OpenPawz stack	Tool discovery + memory + earned expertise	Treats agents as systems that should improve over time

That is the deeper thesis:

The future agent is not just one that can call tools.
It is one that can find, remember, and earn.

Implementation

At a high level, these ideas show up across the engine like this:

Area	Purpose
`tool_index`	Semantic tool retrieval and domain expansion
`request_tools`	Agent-facing meta-tool for hot-loading capabilities
`chat / agent loop`	Carry discovered tools across reasoning rounds
`engram/*`	Persistent memory, recall, consolidation, graph traversal
`procedural memory + FORGE metadata`	Verified skills, certification state, lineage, and re-training hooks

The conceptual flow looks like this:

Agent requests the capabilities it actually needs
let tools = request_tools("workflow troubleshooting + GitHub + message follow-up", state);
Agent uses the discovered tools
let result = run_with_tools(tools, user_request).await?;
Engram stores the useful outcome as memory
engram.capture(result).await?;
Repeated successful procedures can later be evaluated by FORGE
forge.evaluate_procedure_history().await?;`

Different layers. One system.

The bigger vision

The OpenPawz thesis is not that tools are enough.

It is that useful agents need three properties at the same time:

Dynamic capability access
The agent should not carry every tool all the time.
Structured long-term memory
The agent should not forget everything between tasks.
Compounding skill formation
The agent should not repeat the same learning curve forever.

That is why the Librarian Method, Engram, and FORGE belong together.

Try it

The Librarian Method is part of OpenPawz. The bigger idea is to pair it with memory and skill growth instead of treating tools as the whole system.

Ask an agent to do something capability-heavy:

“Check my GitHub notifications, summarize anything important, and message me if there’s a failing workflow.”

The agent can discover the right tools for the task.

Then, if the same pattern happens again later, Engram can help it start with memory instead of amnesia.

And if that procedure becomes well-tested and repeatable, FORGE is the layer that can eventually treat it as earned expertise rather than one-off luck.

Read the full specs

The technical references live in the repo:

Star the repo if you want to track progress. 🙏

OpenPawz — Your AI, Your Rules

A native desktop AI platform that runs fully offline, connects to any provider, and puts you in control. Private by default. Powerful by design.

openpawz.ai

DEV Community