DEV Community

Cover image for The Librarian Method: How OpenPawz solves tool bloat — and why memory matters
Gotham64
Gotham64

Posted on

The Librarian Method: How OpenPawz solves tool bloat — and why memory matters

The tool bloat problem nobody talks about

Every AI agent platform hits the same wall: tool bloat.

An agent that can send emails, manage files, query databases, search the web, post to Slack, and call APIs needs a growing pile of tool definitions in its context window. Connect it to external systems, automation platforms, or MCP servers, and you've consumed a meaningful chunk of context before the agent has even started reasoning about the user's request.

The conventional fixes all have critical flaws:

Approach Problem
Load all tools Breaks down as tool count grows — schemas and descriptions crowd out actual reasoning
Pre-filter by keyword Fragile. “Send a message to John” — email? Slack? SMS? WhatsApp? Telegram?
Category menus Pushes routing burden onto the user
Static tool sets per agent Limits what each agent can do — defeats the point of a general platform

The fundamental issue: the system decides which tools are relevant before the LLM has understood the user's intent. That's solving the wrong problem. Only the LLM knows what the user is actually asking for.

OpenPawz solves this with the Librarian Method — a technique that inverts tool discovery entirely.

OpenPawz

Star the repo — it's open source


The invention: let the agent ask the librarian

The metaphor is literal. A library patron (the agent) walks up to a librarian and describes what they need. The librarian finds the right books. The patron never needs to know the filing system.

Three roles make this work:

Role Implementation
Patron The LLM reasoning over the user's request
Librarian An embedding-powered retrieval layer that maps intent to tools
Library A searchable tool index built from tool definitions and domains

The Patron understands intent. The Librarian searches for matching tools by semantic similarity. The Library stores every tool as an embedding vector organized by capability domain.

That means the agent only sees the tools it needs after it understands the task.


How it works — round by round

Round 1: User says "Email John about the quarterly report"

  Agent has: a small core set of tools
  Agent understands intent: needs email capabilities
  Agent calls: request_tools("email sending capabilities")

  Librarian embeds the request
  Semantic search runs against the tool index
  Top matches: email_send, email_read
  Domain expansion pulls in closely related tools

Round 2: Tools are hot-loaded into the current turn

  Agent now has: core tools + email tools
  Agent calls: email_send({to: "john@...", subject: "Quarterly Report", ...})
  Done ✅
Enter fullscreen mode Exit fullscreen mode

The agent used only the tools it needed instead of dragging every available tool definition into the prompt.


Five design decisions that make it work

1. Agent-driven discovery

The LLM forms the search query — not a brittle pre-filter guessing from the raw user message.

When a user says "Can you check if the deployment went through?", a keyword filter might match deploy, check, or container. The agent understands the real intent is monitoring and calls something closer to request_tools("deployment status monitoring CI/CD").

That is a far better search query because it comes after reasoning, not before it.

2. Domain expansion

When the Librarian finds one strong match, it can also bring along closely related tools from the same domain.

If the agent finds email_send, it probably also needs email_read, contact lookup, or attachment handling. Related capabilities travel together so the agent doesn't need to repeatedly rediscover the same cluster.

3. Round carryover

Tools loaded in one reasoning round remain available in the next round of the same turn.

The agent doesn't lose access to the tools it just discovered, but the set also doesn't accumulate forever across unrelated turns.

4. Fallback layers

If semantic search is weak, the system still has multiple ways to recover:

  1. Exact name match
  2. Domain match
  3. Domain list return so the agent can refine its own request

The agent always gets something actionable back.

5. Memory-aware execution

Tool discovery alone is not enough.

Once an agent finds and uses the right tool, it still needs to remember what happened, what worked, what failed, and what should be reused later. That is where Engram enters the picture.

The Librarian answers: Which tool should I use?
Engram answers: What should I remember from using it?


Tool discovery alone is not enough

Most agent systems stop too early.

They focus on tool routing, but real agent behavior has three distinct problems:

Problem What it asks
Tool discovery Which capability should the agent use right now?
Memory What should the agent retain across turns and sessions?
Expertise How does repeated success become something better than a prompt?

OpenPawz treats these as separate layers:

Layer Purpose
The Librarian Method Discover the right tools on demand
Project Engram Give the agent structured, persistent memory
The Forge Turn repeated procedural success into earned expertise

That stack is the real idea.


Engram: memory that behaves like cognition, not a key-value dump

Most AI memory systems are still basically: store blobs, search blobs, inject blobs.

That works up to a point, but it has obvious failure modes:

Flat memory model Problem
Store everything the same way No difference between facts, episodes, and procedures
Always retrieve Wastes latency and pollutes context
Never forget Outdated information lingers forever
No structure Repeated experiences never become organized knowledge
No budget awareness Memory recall competes blindly with the context window

Project Engram is OpenPawz’s memory architecture for persistent agents.

Instead of treating memory like a bag of documents, Engram models memory as a living system with multiple layers:

  • Sensory input for what just happened
  • Working memory for what the agent is actively thinking about
  • Long-term memory for what should persist across sessions
  • Graph relationships so memories are connected, not isolated
  • Consolidation and decay so the memory store improves over time instead of just growing forever

The memory model: from raw input to durable knowledge

At a high level, Engram works like this:

Engram

A user message enters a sensory buffer. Important items get promoted into working memory. Useful outcomes are captured into long-term memory. Later, when the agent needs context again, Engram decides what to retrieve and what to ignore.

That sounds simple, but the key is that memory is not just being stored — it is being ranked, consolidated, and filtered.


Three tiers, three jobs

Engram uses a three-tier memory architecture:

Tier Role Purpose
Sensory Buffer What just happened Holds raw turn-level input before selection
Working Memory What the agent is actively thinking about Maintains a priority-limited attention set
Long-Term Memory What should survive across sessions Stores episodic, semantic, and procedural memory

That separation matters because not every piece of information deserves the same lifespan.

A tool result from a minute ago may belong in working memory.
A learned user preference may belong in long-term semantic memory.
A successful multi-step workflow may belong in procedural memory.

Flat memory systems blur all of that together. Engram does not.


Retrieval should be gated, not automatic

Another failure mode in agent memory systems is that they retrieve memory for everything.

But not every query needs memory.

If the user asks for a calculation, a greeting, or something already covered in the active conversation, memory search is wasteful. It adds latency and pollutes the prompt with irrelevant context.

So Engram adds a retrieval gate.

Retrieval Gate

The gate decides whether retrieval is needed at all. That means the system is not just asking:

“What memories match this query?”

It is first asking:

“Should I even search memory right now?”

That distinction matters more than it sounds.


Search is hybrid, not naive

Engram does not rely on a single retrieval method.

It combines multiple signals:

Signal Strength
Full-text search Good for exact terms, identifiers, names, phrases
Vector similarity Good for meaning, paraphrase, conceptual recall
Graph traversal Good for connected ideas, related facts, causal links

That lets the system answer different kinds of questions more intelligently.

  • A factual query may weight lexical matches more heavily.
  • A conceptual query may weight semantic similarity more heavily.
  • A broader exploratory query may benefit from graph expansion.

This is what makes Engram more than “RAG but local.” It is memory retrieval shaped by query intent.


Memory should not just grow forever

A real memory system needs a theory of forgetting.

Without that, every stored fact competes forever for retrieval and context budget. Quality degrades because stale, duplicate, or low-value memories remain in circulation.

That is why Engram treats forgetting as a feature.

What forgetting means here

Forgetting in Engram is not random deletion. It is controlled memory maintenance:

  • duplicates can be merged,
  • contradictions can be resolved,
  • stale low-value memories can fade,
  • important memories can persist longer,
  • and quality can be measured before and after cleanup.

This is one of the most important differences between a memory architecture and a document pile.

A pile only gets larger.
A memory system should get cleaner.


Memory is a graph, not a folder

Long-term memory in Engram is not just a list of rows. It is a graph.

Edge Type Meaning
RelatedTo General association
CausedBy Causal relationship
Supports Supporting evidence
Contradicts Conflicting knowledge
PartOf Component or hierarchy relationship
FollowedBy Temporal sequence
DerivedFrom Origin or lineage
SimilarTo Semantic similarity

That structure matters because recall should not always stop at direct matches.

Sometimes the most useful thing is not the first memory you find — it is the memory connected to the first memory.

That is where graph-based retrieval becomes meaningful. The agent can move from direct hits to adjacent context instead of pretending every useful insight must be textually similar to the exact query.


Procedural memory is where things get interesting

Most memory systems focus on facts.

Engram also stores procedures — not just what is true, but how to do things.

That means OpenPawz can remember:

  • how a deployment was fixed,
  • how a file transformation worked,
  • how an API issue was resolved,
  • how a workflow was built successfully.

This turns memory from passive recall into active reuse.

A fact helps the agent answer.
A procedure helps the agent act.

That is the bridge from memory into expertise.


THE FORGE: specialists should earn expertise

Most AI platforms create “specialists” by stuffing a domain document into the prompt and calling it expertise.

That is not expertise. It is a cheat sheet.

A prompt-based specialist has obvious problems:

Prompt specialist Problem
Claims expertise But has never been tested
Answers confidently Even when the knowledge is stale
Looks specialized But has no measurable boundary
Can be copied instantly The whole “specialist” is often just a file

FORGE is OpenPawz’s answer.

It extends Engram’s procedural memory so that repeatable workflows can move through a lifecycle:

Lifecycles Workflows

That means procedures are not all equal.

Some are just memories.
Some are developing skills.
Some are skills the system can treat as trusted and reusable.


The moat is not the prompt

This is the deeper idea behind FORGE:

You can copy a prompt file.
You cannot copy accumulated, verified training cycles overnight.

That is a very different kind of defensibility.

If a system has gone through repeated tasks, retained successful procedures, linked them into skill relationships, measured confidence, and re-trained when things drift, then its expertise is not just text in a system prompt anymore.

It is embedded into the behavior of the system through memory, validation, and reuse.

That is much harder to fake.


How FORGE fits into Engram

FORGE does not create a separate storage system. It extends the memory system already there.

Engram capability FORGE uses it for
Procedural memory Stores the procedures that can be trained and certified
Memory edges Builds skill trees and prerequisite relationships
Trust / confidence signals Distinguishes stronger skills from weaker ones
Decay and consolidation Detects drift, staleness, and retraining candidates
Meta-cognition Helps the agent know what it knows and what it does not

That is an important design choice.

FORGE is not “yet another layer with duplicated storage.” It is training logic built on top of the same memory substrate.


What this changes in practice

Once you combine these pieces, the agent stops behaving like a thin wrapper around a prompt.

Without this stack

  • Tools are either overloaded or under-available
  • Memory retrieval is noisy or missing
  • Learned workflows disappear between sessions
  • Specialists are mostly branding

With this stack

  • The agent discovers capabilities at the moment of need
  • Useful outcomes persist across sessions
  • Memory becomes cleaner instead of just larger
  • Procedures can compound into reusable skills
  • Specialization can be measured instead of merely declared

That is the real architecture shift.


This is the stack, not just a trick

The Librarian Method is useful on its own. But the bigger story is not just “better tool retrieval.”

It is this:

Layer Question it answers
The Librarian Method Which tool should the agent use?
Project Engram What should the agent remember?
The Forge Which remembered procedures count as real expertise?

That progression matters.

Tool retrieval solves capability access.
Memory solves continuity.
FORGE solves compounding competence.

That is what makes OpenPawz more interesting than a standard tools-plus-prompt system.


A concrete example

Imagine a user says:

“Check the GitHub issue, figure out why the workflow failed, and send me a summary.”

A conventional agent might:

  • load too many tools,
  • search memory poorly,
  • and start from zero every time.

An OpenPawz agent can do something more structured:

  1. Use the Librarian Method to discover GitHub and messaging tools
  2. Execute the workflow investigation
  3. Store the findings in Engram
  4. Recall related failures later through hybrid search and graph links
  5. Reuse a previously successful troubleshooting procedure
  6. Eventually treat that procedure as validated expertise through FORGE

That is not just calling tools.
That is finding, remembering, and learning.


Why this is different from current approaches

A lot of systems optimize one piece in isolation:

Approach What it gets right What it misses
Tool retrieval only Better capability routing No persistent memory, no compounding expertise
Basic RAG memory Better recall than no memory Flat storage, no forgetting, weak procedural learning
Prompt specialists Fast to ship No verification, no boundaries, no moat
Fine-tuning alone Compresses behavior into weights Harder to inspect, slower to update, weaker explicit skill tracking
OpenPawz stack Tool discovery + memory + earned expertise Treats agents as systems that should improve over time

That is the deeper thesis:

The future agent is not just one that can call tools.
It is one that can find, remember, and earn.


Implementation

At a high level, these ideas show up across the engine like this:

Area Purpose
tool_index Semantic tool retrieval and domain expansion
request_tools Agent-facing meta-tool for hot-loading capabilities
chat / agent loop Carry discovered tools across reasoning rounds
engram/* Persistent memory, recall, consolidation, graph traversal
procedural memory + FORGE metadata Verified skills, certification state, lineage, and re-training hooks

The conceptual flow looks like this:

  1. Agent requests the capabilities it actually needs
    let tools = request_tools("workflow troubleshooting + GitHub + message follow-up", state);

  2. Agent uses the discovered tools
    let result = run_with_tools(tools, user_request).await?;

  3. Engram stores the useful outcome as memory
    engram.capture(result).await?;

  4. Repeated successful procedures can later be evaluated by FORGE
    forge.evaluate_procedure_history().await?;`

Different layers. One system.


The bigger vision

The OpenPawz thesis is not that tools are enough.

It is that useful agents need three properties at the same time:

  1. Dynamic capability access
    The agent should not carry every tool all the time.

  2. Structured long-term memory
    The agent should not forget everything between tasks.

  3. Compounding skill formation
    The agent should not repeat the same learning curve forever.

That is why the Librarian Method, Engram, and FORGE belong together.


Try it

The Librarian Method is part of OpenPawz. The bigger idea is to pair it with memory and skill growth instead of treating tools as the whole system.

Ask an agent to do something capability-heavy:

“Check my GitHub notifications, summarize anything important, and message me if there’s a failing workflow.”

The agent can discover the right tools for the task.

Then, if the same pattern happens again later, Engram can help it start with memory instead of amnesia.

And if that procedure becomes well-tested and repeatable, FORGE is the layer that can eventually treat it as earned expertise rather than one-off luck.


Read the full specs

The technical references live in the repo:

Star the repo if you want to track progress. 🙏

OpenPawz — Your AI, Your Rules

A native desktop AI platform that runs fully offline, connects to any provider, and puts you in control. Private by default. Powerful by design.

favicon openpawz.ai

Top comments (0)