DEV Community: Shilpa Mitra

3 Open-Source Repos That Each Kill a Different AI Bill(token, infra, creative)

Shilpa Mitra — Thu, 25 Jun 2026 15:40:00 +0000

Your AI spend is not one number. It is three: the tokens you feed the model, the infrastructure to run agents, and the paid tools you bolt on around them. Most cost-cutting advice optimizes one and ignores the other two.

This is a short, honest roundup of three popular open-source repos that each attack a different bill. Free, self-hosted, and verified before I wrote a word. I will give you the real numbers and the real catch on each, because "free" is only a deal if the thing actually works.

TL;DR

Repo	Cuts your...	License	Stars
codebase-memory-mcp	token bill	MIT	~13.8k
flue	agent-infra bill	Apache-2.0	~6.6k
OpenMontage	creative-tool bill	AGPL-3.0	~18k

Cut your token bill: codebase-memory-mcp

codebase-memory-mcp is the one with receipts. It indexes your repo into a persistent knowledge graph (functions, classes, call chains, HTTP routes) across 158 languages, ships as a single static binary with zero dependencies, and exposes it to your agent over MCP.

The point: your coding agent stops re-reading the same files into context on every question and queries the map instead. Re-feeding your codebase is the single biggest source of wasted token spend in agentic coding, and a knowledge graph is the clean fix.

It is not just a marketing claim. The project's preprint (arXiv:2603.27277) reports, across 31 real repositories, roughly 10x fewer tokens and about 2x fewer tool calls versus file-by-file exploration, while keeping answer quality high. It plugs into 11 coding agents and persists to a local SQLite cache.

# Index once, then your MCP-compatible agent queries the graph
codebase-memory-mcp index .

The honest catch: it is a structural backend, not an LLM. The savings come from feeding your agent less, not from it being smarter. Index your actual codebase and measure the token drop on your own tasks rather than taking the headline number on faith. Here is the verified setup with the savings measured.

Cut your agent-infrastructure bill: flue

flue (from the Astro team) is a TypeScript framework for building headless agents that deploy anywhere: Node, Cloudflare, CI.

The money lever is its default sandbox. Instead of spinning up a full container for every agent, flue defaults to a lightweight virtual sandbox, which its docs pitch as far cheaper and more scalable than a container per agent. You can still opt into a local or remote container when a job genuinely needs one. At any real volume, that is the difference between paying for one box and paying for a fleet.

// A whole agent: a prompt and a typed result, no container required
import { createAgent } from '@flue/runtime';

const translator = createAgent(() => ({ model: 'anthropic/claude-sonnet-4-6' }));
const harness = await init(translator);
const session = await harness.session();
const { data } = await session.prompt('Translate "hello" to French');

The honest catch: it is explicitly experimental and the API may still change, so pin your version and expect some churn before you build something load-bearing on it. Here is the verified deploy setup.

Cut your creative-tool bill: OpenMontage

OpenMontage turns a coding assistant into a full video production system: 12 pipelines, 52 tools, and 500+ agent skills spanning scripting, asset generation, editing, and final composition with FFmpeg and Remotion. The pitch is replacing a stack of paid AI-video and editing subscriptions with one open pipeline you run yourself.

Two honest notes:

There is a genuinely free path. You can run it end to end with zero paid APIs using free local text-to-speech (Piper) and public-domain footage (Archive.org, NASA, Wikimedia). Wire in paid AI models only when you want generated assets, so paid generation is an upgrade, not a requirement.
The bigger watch-item is the license. AGPL-3.0 is copyleft: fine for personal and internal use, but it carries real obligations if you build a commercial product on top, so read it first.

Here is the verified free-pipeline setup.

How to choose

Pick by the bill that hurts most:

Tokens are your problem? Start with codebase-memory-mcp. It is the most direct win and the only one here with published numbers behind it.
Running agents at scale? flue's default sandbox is the infrastructure win.
Paying for a pile of creative subscriptions? OpenMontage replaces the pipeline, and there is a fully free way to run it.

All three are free to try, so measure the saving on your own usage rather than trusting the README. That habit, verify before you trust, matters more than any single tool.

Where this leaves you

These three cut the tokens, the infrastructure, and the tools. The fourth bill is the model itself. If that is the part that hurts, I just wrote up the two cheapest ways to get top-tier results without the premium price (one open model you can own, one API that runs a team of models for you): Two new ways to get top-tier AI without paying top-tier prices.

If you have a cost-cutting open-source repo I missed, drop it in the comments. I verify and add the good ones.

How to Set Up Persistent Memory for Codex Using Obsidian (3 Approaches)

Shilpa Mitra — Mon, 25 May 2026 18:04:46 +0000

Codex has no long-term memory. Every session starts clean. You explain your project structure, your naming conventions, your testing preferences, the thing you decided last Tuesday about the API design. Then you close the terminal and do it all over again tomorrow.

The fix is giving Codex a memory layer that persists between sessions. And the best place to store that memory is Obsidian, because it's just markdown files on disk. No proprietary database. No sync service you don't control. Every note is a plain text file you can read, edit, search, and version control yourself.

I tested three different approaches to wiring Codex memory into Obsidian. Each one solves the problem differently, and the right pick depends on how much setup you want to deal with and how deep you want the integration to go.

Here's every approach, what it actually does, and how to set it up from scratch.

First: Understanding How Codex Memory Actually Works

Before wiring anything to Obsidian, you need to understand the two memory layers Codex already has built in.

Layer 1: AGENTS.md (Static Instructions)

This is a markdown file you place at the root of your repo. Codex reads it at the start of every session before doing any work. Think of it as a briefing document. You put your project conventions, testing commands, directory layout, and anything the agent needs to know every single time.

AGENTS.md is checked into version control. It's shared across the team. It's the right place for rules that should always apply.

Quick example of what goes in here:

# AGENTS.md

## Project
- Next.js 14 app with TypeScript
- Tailwind for styling, no CSS modules
- All API routes live in /app/api/

## Testing
- Run `pnpm test` before committing
- All new functions need at least one unit test
- Test files go next to the source file, named *.test.ts

## Conventions
- Use kebab-case for file names
- Commit messages follow Conventional Commits: feat(scope): description
- Never modify files in /config/production/

Codex loads this automatically. No config needed. Just create the file.

You can also run /init inside a Codex session, and it will scaffold an AGENTS.md based on your project's detected tech stack, directory structure, and config files. Good starting point if you don't want to write it from scratch.

One thing to watch: Codex concatenates AGENTS.md files from the repo root down to your current directory, and stops at 32 KiB combined size. If your instructions are being ignored, you might be hitting the size limit. You can verify by asking Codex: "Summarize the instructions you have loaded for this session."

Layer 2: Native Memories (Auto-Generated)

This is the newer system. When enabled, Codex automatically summarizes your sessions in the background and writes those summaries to ~/.codex/memories/. The next time you start a session, it reads those summaries back in. You don't paste anything. You don't reference anything. The context just shows up.

The memory pipeline works in two phases. Phase 1 runs after a session has been idle long enough (six hours by default, configurable via min_rollout_idle_hours from 1 to 48). It extracts key context from the conversation, redacts any secrets it finds, and stores a structured summary. Phase 2 periodically consolidates all those individual summaries into a unified memory file that gets injected into future sessions.

The storage layout under ~/.codex/memories/:

~/.codex/memories/
├── memory_summary.md      # High-level summary injected into every session
├── MEMORY.md              # Searchable registry of aggregated insights
├── raw_memories.md        # Temporary merge used during consolidation
├── rollout_summaries/     # Per-thread recaps with lessons learned
└── skills/                # Reusable procedures the agent discovered

To enable it:

# ~/.codex/config.toml
[features]
memories = true

Or as a one-time CLI override: codex -c features.memories=true

Or in the Codex app: Settings > Memories > Enable.

Once it's on, you can fine-tune the behavior:

[memories]
generate_memories = true    # Let new threads create memory entries
use_memories = true         # Inject existing memories into new sessions

You can also run these independently. Want Codex to read old memories but not generate new ones? Set generate_memories = false and use_memories = true. Useful for debugging or when you want to freeze the memory state.

Inside a running session, type /memories to control whether that specific thread can use or generate memories. This doesn't touch your global settings.

Important caveat: Native memories are off by default and currently unavailable in the EEA, UK, or Switzerland. Also, memories are per-user. If your team shares a Codex environment, individual memories don't pool across teammates. Team-wide context belongs in AGENTS.md.

Where Obsidian Comes In

The two layers above work, but they have limits. AGENTS.md is static and manual. Native memories are auto-generated but opaque and not easily searchable. Obsidian gives you a visual, organized, searchable knowledge base that your agent can read from and write to. And because Obsidian is just a folder of markdown files, it plays nicely with every tool in the chain.

Approach 1: Basic Memory + MCP (Easiest Setup, Cross-Tool Compatible)

This is the fastest path to persistent Codex memory that syncs with Obsidian.

Basic Memory is an MCP server that gives any AI tool (Codex, Claude Code, Cursor, Claude Desktop) persistent context through plain markdown files. You store notes in a folder. Basic Memory indexes them. Codex queries them through MCP. And because the storage format is just markdown, you point Obsidian at the same folder and everything shows up in your vault with full graph view, backlinks, and search.

What this looks like in practice

You're three weeks into building an API. You've made decisions about auth strategy, database schema, rate limiting approach, error handling patterns. All of that context lives in Basic Memory notes.

You open a new Codex session and say: "What decisions have we made about the API design? Check my notes."

Codex uses semantic search through MCP, finds the relevant notes across your project, and answers grounded in your actual project history. No re-explaining. No pasting old conversations.

You switch to Claude Code for a different task on the same project. Same notes. Same context. Zero re-setup.

Setup (Local)

codex mcp add basic-memory bash -c "uvx basic-memory mcp"

That's the entire install. One command. The uvx approach handles dependency resolution automatically and runs Basic Memory as a child process.

To scope it to a specific project:

codex mcp add basic-memory bash -c "uvx basic-memory mcp --project your-project-name"

Verify it's connected:

codex mcp list

You should see basic-memory listed.

Setup (Cloud, for remote access)

If you want cloud-hosted memory:

1. Create an API key at app.basicmemory.com under Settings > API Keys

2. Add it to your shell profile:

echo 'export BASIC_MEMORY_API_KEY=your-key-here' >> ~/.zshrc
source ~/.zshrc

3. Add to your Codex config:

# ~/.codex/config.toml
[mcp_servers.basic-memory]
url = "https://cloud.basicmemory.com/mcp"
bearer_token_env_var = "BASIC_MEMORY_API_KEY"

Connecting Obsidian

Open Obsidian. Create a new vault. Point it at your Basic Memory directory (~/basic-memory by default, or your project folder). That's it. The same markdown files your AI writes show up in Obsidian with graph view, backlinks, and rich editing. No import or export step.

Notes you create in Obsidian are immediately available to Codex. Notes Codex creates show up in Obsidian. Same files, two interfaces.

When to use this approach

You want the fastest setup, you use multiple AI tools (not just Codex), and you want your memory notes to be plain markdown you can browse and edit in Obsidian.

Approach 2: Structured Obsidian Vault with AGENTS.md + Codex Hooks (Deepest Integration)

This is the power-user option. Instead of a third-party memory layer, you build a structured Obsidian vault that Codex reads directly through AGENTS.md and lifecycle hooks.

The idea is simple: your Obsidian vault becomes your project's knowledge base. AGENTS.md tells Codex how the vault is organized, what the naming conventions are, and where to find things. Codex hooks automatically inject context from the vault at session start so you never have to re-explain what's going on.

Where Basic Memory gives you a shared note store through MCP, this approach gives you full control with zero external dependencies. Everything stays in your vault, everything is plain markdown, and Codex reads it natively.

What this looks like in practice

You open the terminal in your vault directory and run Codex. The SessionStart hook fires automatically, reads your vault's index file, and injects a summary of active projects, recent decisions, and open tasks into the session. Codex knows what's going on before you type a single word.

You say: "What did we decide about the caching strategy last week?" Codex reads the decision records in your vault and pulls the answer from your own notes.

During the day, every note you create gets filed with YAML frontmatter, tagged, and linked. Decision records, project notes, architecture docs. Codex follows the structure defined in AGENTS.md and files things consistently.

The vault structure

projects/          # One folder per active project
decisions/         # Architecture and design decision records
memory/            # Persistent context Codex reads across sessions
memory/goals.md    # Current priorities and focus areas
memory/index.md    # Map of everything in the vault
templates/         # Note templates with YAML frontmatter
reference/         # Codebase knowledge, API docs, architecture maps

Step 1: Create AGENTS.md at the vault root

This is Codex's operating manual for your vault. Here's a practical example:

# AGENTS.md

## Vault Structure
- /projects/ contains one folder per active project
- /decisions/ contains architecture decision records (ADR format)
- /memory/goals.md has current priorities. Read this first every session.
- /memory/index.md is the vault map. Scan it to know what exists.

## Note Conventions
- All notes use YAML frontmatter with: title, date, status, tags
- Status values: active, completed, archived, deprecated
- File names use kebab-case: my-decision-about-caching.md
- Link related notes using [[wikilinks]]

## When Creating Notes
- Decision records go in /decisions/ with ADR format
- Project notes go in /projects/{project-name}/
- Always update /memory/index.md when creating new notes

## When Starting a Session
- Read /memory/goals.md for current priorities
- Check /memory/index.md for vault overview
- Look at recent git commits to see what changed since last session

Step 2: Set up the SessionStart hook

Codex hooks let you run scripts at specific lifecycle events. The SessionStart event fires when a session begins and can inject context automatically.

Hooks are experimental and currently disabled on Windows. You need to enable the feature flag first:

# ~/.codex/config.toml
[features]
codex_hooks = true

Then create .codex/hooks.json in your vault:

{
  "hooks": {
    "SessionStart": [
      {
        "matcher": "startup|resume",
        "hooks": [
          {
            "type": "command",
            "command": "cat memory/goals.md memory/index.md",
            "timeout": 10
          }
        ]
      }
    ]
  }
}

The structure is: event name as a key, then an array of matcher groups, each containing a matcher regex and a hooks array. For SessionStart, the matcher filters on how the session started (startup or resume). The timeout is in seconds (default is 600 if omitted). Any plain text the command writes to stdout gets injected as developer context into the session.

This reads your goals and vault index, then injects them as context at the start of every Codex session. Codex sees this before you type a single word.

You can make the hook smarter. A script that pulls recent git changes, scans for notes modified in the last 48 hours, and builds a compact briefing:

#!/bin/bash
# .codex/session-start.sh
echo "## Current Goals"
cat memory/goals.md
echo ""
echo "## Recently Modified Notes"
find . -name "*.md" -mtime -2 -not -path "./.codex/*" | head -20
echo ""
echo "## Recent Changes"
git log --oneline -10 2>/dev/null || echo "No git history"

Then update the hook to point to the script:

{
  "hooks": {
    "SessionStart": [
      {
        "matcher": "startup|resume",
        "hooks": [
          {
            "type": "command",
            "command": "bash .codex/session-start.sh",
            "timeout": 10,
            "statusMessage": "Loading vault context"
          }
        ]
      }
    ]
  }
}

Step 3: Open the vault in Obsidian

Open the same folder as an Obsidian vault. You get graph view across all your notes, backlinks between decisions and projects, full-text search, and a visual interface for browsing everything Codex writes. Same files, two interfaces.

Step 4: Run Codex from the vault directory

cd ~/your-vault && codex

Codex loads AGENTS.md, the SessionStart hook fires and injects your goals and index, and you're working with full context from the first prompt.

What sets this apart from the other approaches

No external tools. No MCP servers. No API keys. Everything is AGENTS.md (instructions), hooks (automation), and markdown files (knowledge). The vault is fully portable. You can version control it with git, sync it however you want, and switch to a different agent later without rebuilding anything. A well-documented vault in markdown is not locked to any single AI tool.

When to use this approach

You want Obsidian as the center of your workflow with zero external dependencies. You want to control exactly what Codex sees and how the vault is organized. You're comfortable writing an AGENTS.md and a simple hook script.

Approach 3: Native Memories + Manual Obsidian Sync (Minimal Setup, Good Enough for Most)

If you don't want to install anything extra, you can use Codex's built-in memory system and just point Obsidian at the memory folder.

This is the simplest approach. Enable native memories, let Codex auto-generate summaries, and open ~/.codex/memories/ as an Obsidian vault (or add it as a folder inside an existing vault). You get visual browsing and search over everything Codex remembers.

The tradeoff: this is read-only from Obsidian's perspective. You can look at the memory files, but hand-editing them isn't the supported path. Codex treats ~/.codex/memories/ as generated state that it manages itself. If you want to give Codex persistent instructions, put them in AGENTS.md instead.

Setup

1. Enable memories:

# ~/.codex/config.toml
[features]
memories = true

2. Open ~/.codex/memories/ as an Obsidian vault, or symlink it into an existing vault:

ln -s ~/.codex/memories/ ~/your-vault/codex-memories

3. Work normally. After sessions go idle (six hours by default), Codex processes them in the background. Summaries appear in the folder. Obsidian picks them up automatically.

What this looks like in practice

You've been using Codex on a project for two weeks. You open the codex-memories folder in Obsidian and see rollout summaries for each session, a consolidated memory summary, and any skills the agent discovered. You can search across all of them, see patterns in your workflow, and spot context that Codex is carrying forward.

When you start a new Codex session, the agent reads memory_summary.md (capped at 5,000 tokens to preserve context window) and has access to the rest of the memories folder if it needs deeper context.

When to use this approach

You just want Codex to remember things between sessions and you want a way to browse what it remembers. You don't need cross-tool memory sharing or a structured vault system.

Which Approach Should You Pick?

Start with Approach 3 if you just want Codex to stop forgetting things. It takes 30 seconds to enable native memories and one more minute to point Obsidian at the folder.

Move to Approach 1 when you start using multiple AI tools or want your notes to be the source of truth (not Codex's auto-generated summaries). Basic Memory gives you clean two-way sync between your vault and every MCP-compatible tool.

Go with Approach 2 when you want full control with zero external dependencies. AGENTS.md plus a SessionStart hook gives you a self-contained vault that Codex reads natively. No MCP servers, no API keys, just markdown and a hook script.

You can also combine them. I run native memories for auto-capture plus Basic Memory for structured project knowledge that I want searchable across tools. The native layer catches things I forget to document. Basic Memory holds the deliberate notes I want to persist long-term.

Quick Gotchas

A few things that tripped me up during setup.

AGENTS.md vs Memories: Don't rely on memories for rules that must always apply. Memories are a personal recall layer. Put team-wide conventions and project rules in AGENTS.md where they're version controlled and shared.

Memory timing: Codex doesn't generate memories immediately when you close a session. It waits until the thread has been idle long enough (six hours by default). Don't panic when the memories folder doesn't update right away.

Size limits: The memory summary injected into each session is capped at 5,000 tokens. If you're building a massive knowledge base, not everything will make it into every session. The agent can still read deeper into the memories folder when it needs to, but the automatic injection has a ceiling.

Secret redaction: Codex redacts secrets from generated memories, but review your memory files before sharing your ~/.codex directory or committing any memory artifacts. The redaction is good but not perfect.

EEA/UK/Switzerland: Native memories aren't available in these regions yet. Use Approach 1 or 2 instead.

If you found the breakdown of Hermes and how it scales from one agent to a full team useful, the architecture behind that post pairs well with this one: I Put an Autonomous AI Agent on My Laptop. It Saved Me 7 Hours in Week One.

We cover practical AI agent setups like this twice a week at Web After AI. No hype, just what works and how to set it up.

How Claude Code Achieves a 92% Cache Hit Rate: A Deep Dive Into Prompt Caching for AI Agents

Shilpa Mitra — Sun, 24 May 2026 17:08:31 +0000

If you're running AI agents in production, there's a cost you're probably not thinking about.

Every turn in an agentic conversation sends the full prompt to the model. That includes the system instructions, all the tool definitions, any project context that was loaded earlier, and the entire conversation history. The model processes all of it. From the top. Every single time.

For a quick two-turn interaction, this doesn't matter much. But for a 50-turn coding session where the system prompt alone is 20,000 tokens? That's 1 million tokens of repeated computation across the session, all billed at full input price, all producing zero new insight. The model already processed that system prompt 49 turns ago. It's just doing it again because nothing told it not to.

This is the problem prompt caching solves. And Claude Code is probably the best case study of how to do it right.

The Two Parts of Every Prompt

The first thing to understand is that not all tokens in a prompt are created equal.

Look at any agentic API call and you'll see two distinct layers:

The foundation. This is everything that stays the same from turn to turn. System instructions, tool schemas, project-level context like a CLAUDE.md file, behavioral rules. If you looked at turn 1 and turn 47 side by side, this part would be identical.

The conversation. This is everything that's different each turn. The user's latest message, tool call results, file contents that were just read, terminal output. This grows with every interaction and is genuinely new information the model needs to process.

The entire trick behind prompt caching is recognizing that the foundation doesn't need to be reprocessed. You compute it once, store the result, and reuse it on every subsequent turn. The model only does fresh work on the conversation layer.

What's Actually Being Cached: The Transformer Angle

This isn't just skipping a string comparison. To understand why caching cuts costs so dramatically, you need to know what the model does when it reads a prompt.

LLM inference has two stages:

Prefill: the model takes your entire input and runs it through dense matrix multiplications, token by token, building an internal representation. This is computationally expensive and it's where most of the time and cost goes.
Decode: the model generates its response one token at a time, mostly just reading from the state it already built during prefill.

During prefill, the model computes three vectors for every token: Query, Key, and Value. These are the building blocks of the attention mechanism, how the model figures out which parts of the input matter for which other parts.

The important property: Key and Value vectors for any given token only depend on the tokens before it. They're deterministic. If the input is the same, the output is the same.

So once you've computed the Key-Value pairs for a 20,000-token system prompt, you can store them. Next time a request comes in with that same prefix, you skip the entire prefill computation for those 20,000 tokens and go straight to processing the new content.

Anthropic's infrastructure does this by hashing the input prefix. Same hash, same cached tensors, no recomputation. Different hash (even one byte different), full recomputation.

The Economics

Here's where this gets concrete. Anthropic's caching pricing has three tiers:

Operation	Multiplier	What it means
Cache reads	0.1x base input price	90% discount on every cached token
5-minute cache writes	1.25x base input price	Small premium to store the KV tensors
1-hour cache writes	2x base input price	Extended TTL for longer sessions

For Claude Sonnet 4.6 ($3/MTok base input), here's what that looks like in practice:

Standard input:     $3.00 / MTok
Cache read:         $0.30 / MTok   (90% savings)
5-min cache write:  $3.75 / MTok   (25% premium, one-time)
1-hour cache write: $6.00 / MTok   (2x premium, one-time)

A cache hit costs 10% of standard input. That means caching pays for itself after just one subsequent read for the 5-minute duration. For a 50-turn session reusing a 20,000-token prefix, the savings compound on every single turn.

Tracking a Real Claude Code Session

Theory is nice. Let's trace the actual token economics of a single debugging session to see where the money goes.

You open Claude Code in a Next.js project. The moment the session starts, it loads the system prompt, all available tool definitions (file read, file write, bash, grep, glob, and others), and your project's CLAUDE.md. That initial payload lands somewhere around 20,000 tokens. Every single one of those tokens is processed fresh. This is the only time you pay full price for them.

You type:

"There's a race condition in the checkout flow. Orders are occasionally duplicating when users double-click the submit button."

Claude Code doesn't just start editing files. First, it spins up an Explore subagent to understand the codebase. That subagent reads your API routes, checks your database schema, looks at your order processing logic, and examines the frontend form handler. All of those file reads and grep results get appended to the growing conversation as tool outputs.

Here's the key: none of that new content touches the 20,000-token prefix. The system prompt, the tool definitions, the CLAUDE.md, all of that is still sitting in cache from turn one. Every subsequent API call reads those 20,000 tokens at $0.30/MTok instead of $3.00/MTok. You're only paying full price for the new stuff: your message and the tool outputs.

The Explore subagent finishes and hands its findings back to the main agent. But it doesn't dump 15,000 tokens of raw file contents into the conversation. It passes a condensed summary: which files are relevant, what the current logic does, where the race condition likely lives. This is a deliberate design choice. Keeping the dynamic tail compact means the cache ratio stays high.

Now the Plan subagent kicks in. It takes the summary, reasons through the fix (idempotency key on the frontend, deduplication check on the API, database unique constraint as a safety net), and produces a step-by-step implementation plan. You approve it. Claude Code starts writing code.

Over the next 15 minutes, you go back and forth. It writes the idempotency logic, you ask it to also handle the case where the page refreshes mid-checkout, it adjusts. Each of these turns adds new content to the dynamic tail. But the foundation, those 20,000 tokens, is read from cache every single time. Each cache hit also resets the TTL, so the cache never expires as long as you keep working.

By the end of the session, you've gone through maybe 25 turns. The total tokens processed easily exceeds 1.5 million. But if you run /cost, the bill tells a very different story than 1.5M tokens at full price. The vast majority were cache reads at a 90% discount.

That's the difference between a $4.50 session and a $0.90 session. For one debugging task.

The Production Numbers

This isn't theoretical. Claude Code's production metrics:

Metric	Value
Cache hit rate	92%
Cost reduction	81%
First-token latency reduction	79%

In active sessions, 95%+ of input tokens are typically cache hits, billed at 0.1x the base price. Out of 400K tokens in a session, maybe 20K to 40K are billed at full price.

Without prompt caching, a long Opus coding session (100 turns with compaction cycles) can cost $50 to $100 in input tokens. With it, $10 to $19.

The One Thing That Will Tank Your Cache Hit Rate

Prompt caching has a gotcha that trips up almost everyone the first time.

The cache key is a hash of the exact byte sequence of your prompt prefix. Not the meaning. Not the content. The exact bytes, in the exact order. If you rearrange two paragraphs in your system prompt, the hash changes. Full cache miss. Everything recomputed at full price.

This has three practical consequences:

1. Don't change your tool set mid-session

Tool definitions are part of the cached prefix. If you add a tool on turn 12 that wasn't there on turn 1, every token after the change point is a cache miss. Load everything you might need at the start.

2. Don't switch models mid-conversation

Each model has its own cache. Moving from Opus to Sonnet to save money on a later turn means rebuilding the cache from zero for the new model. You'll spend more on the rebuild than you saved on the cheaper rate.

3. Don't edit the system prompt to update state

If your agent needs to track something (like "user is now authenticated"), don't inject that into the system prompt. Append it as a note in the next user message instead. The system prompt stays byte-identical, the cache stays valid.

Claude Code follows all three of these rules religiously. That's how it maintains a 92% hit rate across millions of sessions.

Applying This to Your Own Agents

If you're building on the Anthropic API, the same principles apply. Here's the practical playbook.

Prompt structure matters

Put the most stable content at the top:

1. System instructions and rules        (most stable, cached first)
2. Tool definitions                      (stable for session duration)
3. Reference documents / retrieved context
4. Conversation history + tool outputs   (dynamic, grows each turn)

The cache works from the top down. Everything above the first change point stays cached. Everything below it gets recomputed.

Use auto-caching

Anthropic's API now supports automatic cache management. You add a single cache_control field to your request and the system handles breakpoint placement for you:

{
  "model": "claude-sonnet-4-6-20260514",
  "max_tokens": 1024,
  "cache_control": { "type": "ephemeral" },
  "system": "Your system prompt here...",
  "messages": [...]
}

It moves the cache boundary forward as the conversation grows and more content becomes stable. Before this existed, you had to manually calculate token boundaries. Getting it wrong meant missing the cache entirely.

Compact without breaking the cache

When your conversation hits the context limit and you need to summarize it down, keep the system prompt and tool definitions identical. Add the compaction instruction as a new user message. The cached prefix stays valid. You only pay fresh tokens for the compaction prompt itself.

Monitor your hit rate

Every API response includes three fields you should be tracking:

{
  "usage": {
    "cache_creation_input_tokens": 15200,
    "cache_read_input_tokens": 184800,
    "input_tokens": 3400
  }
}

cache_creation_input_tokens: tokens written to cache (first time processing)
cache_read_input_tokens: tokens read from cache (the cheap ones)
input_tokens: tokens processed at full price (no cache available)

The ratio of cache_read_input_tokens to total input tokens is your cache efficiency score. Track it like you'd track uptime. A sudden drop means something in your prompt structure changed and invalidated the cache.

Key Takeaways

Prompt caching isn't a setting you flip on and forget about. It's an architectural pattern that has to be baked into how your agent constructs its prompts, manages its tools, and handles long conversations.

Claude Code shows what this looks like when it's done well: 92% cache hit rate, 81% cost reduction, built on stable prefixes, subagent summarization, and cache-aware context management.

If you're building agents and not thinking about your cache architecture, you're leaving most of your budget on the table.

We break down AI infrastructure and tooling like this regularly at Web After AI. Practical, no hype, explained so it actually makes sense.

The 4 Levels of Hermes Agent Scaling Framework: From One Hermes Agent to a Fully Automated Team

Shilpa Mitra — Fri, 22 May 2026 11:56:54 +0000

Most people set up an AI agent and immediately start thinking about multi-agent architectures. Orchestrators, specialist swarms, automated pipelines. That's Level 4 thinking applied to a Level 1 setup, and it's how you end up with a fleet of agents shipping garbage at scale.

Hermes Agent by Nous Research (160K+ stars, fastest-growing open-source agent of 2026) is built for exactly this kind of progressive scaling. It's self-hosted, self-improving, stores everything locally in SQLite, and supports multi-agent orchestration out of the box as of v0.6.0.

But the framework below isn't Hermes-specific. It applies to any agent system. The tool doesn't matter as much as the progression.

Here are the four levels, what each one looks like in practice, and how to know when you're actually ready to move up.

First: What Hermes Agent Is

Hermes is an autonomous AI agent that runs on your machine or VPS. It takes a goal, breaks it into steps, picks from 47 built-in tools to execute, and iterates until the task is done. Everything stays local.

What sets it apart: after each task, Hermes writes a structured record of what worked and what didn't into episodic memory. On future tasks with similar patterns, it retrieves those records and adjusts its approach before starting. It also creates reusable "skills" from experience, essentially building procedural memory that improves over time.

It connects to 20+ messaging platforms (Telegram, Discord, Slack, WhatsApp, Signal, and more), supports MCP servers, and runs across 6 terminal backends (local, Docker, SSH, Daytona, Singularity, Modal).

Install:

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

Or via pip:

pip install hermes-agent
hermes postinstall

Then configure:

hermes doctor      # check your environment
hermes model       # pick a model
hermes config set  # add API keys
hermes             # start the agent

Takes about 60 seconds on Linux, macOS, or WSL2.

Level 1: The Main Agent

You → Your Soul Hermes Agent

This is where everyone starts, and where most people should stay for weeks, not days.

Your single Hermes instance is your prototype area. You test workflows here. You refine prompts. You figure out which tasks the agent handles well and which ones it fumbles. You build up its memory and skills on your specific work.

At this level, Hermes doubles as your orchestrator by default. You give it a complex task, it breaks it down, it executes. The self-improving loop is already running: every completed task makes it slightly better at similar tasks next time.

What to do at Level 1

Run real work through it daily. Not toy examples. Actual tasks from your workflow. The memory system only gets useful with real data.
Manage its memory actively. Use /recall to search what it remembers and /remember to manually save important context. Correct it when it gets things wrong.
Install skills or let it create them. Skills are procedural memory. Hermes can build them from experience, or you can install community-contributed ones from the Skills Hub.
Connect one messaging platform. Telegram is the easiest. Run hermes gateway setup to get always-on access from your phone. This changes the dynamic from "sitting at my terminal to use AI" to "texting my agent whenever I need something."

When to move on

When you have at least 2-3 workflows that are consistently producing good output. Not acceptable output. Not "close enough." Good output that you'd be comfortable shipping without heavy editing.

This is the most important checkpoint in the entire framework. Everything that comes after multiplies the quality you establish here.

Level 2: Specialized Agents

You → SEO Agent
You → Content Pipeline Agent
You → DevOps Agent

Once a workflow is solid and repeatable, break it out into its own Hermes instance with its own credentials, memory, and scope.

Why separate instances?

Context pollution. An agent that handles your SEO research, your email drafting, and your code reviews is juggling three different domains in one memory space. Its SEO skills get diluted by code review patterns. Its writing voice gets contaminated by technical documentation habits.

Specialized agents have cleaner memory, more focused skills, and better output because they only learn from one domain.

How to do this practically

Each Hermes instance runs independently. Use different configuration profiles, or spin each one up in its own Docker container or VPS.

# Different profiles for different agents
HERMES_PROFILE=seo hermes
HERMES_PROFILE=contentpipeline hermes
HERMES_PROFILE=devops hermes

Each profile gets its own SQLite database, its own memory, its own skill library. You talk to each one directly. You're still the orchestrator at this stage, manually deciding which agent handles which task.

What to do at Level 2

Write a scope document for each agent. What it does, what it doesn't do, what tools it has access to. This isn't bureaucracy. It's how you prevent scope creep across agents.
Let each agent build its own skill library within its domain. The SEO agent's skills should be about keyword research and competitor analysis, not email copywriting.
Keep the count low. 2-3 specialists is plenty to start. The temptation to spin up a new agent for every task is strong. Resist it.

When to move on

When you're spending more time routing tasks between agents than actually reviewing their output.

Level 3: Orchestrated Team

You → Orchestrator Agent
           ↓
     Your Specialized Agents

Now you bring the orchestrator agent back. But this time it's not your prototype agent wearing multiple hats. It's a dedicated Hermes instance whose only job is routing tasks to your specialists and synthesizing their outputs.

Hermes v0.6.0 added multi-agent orchestration. The orchestrator analyzes a complex task, identifies the optimal work breakdown, and spawns specialist worker agents with tailored context. Each worker gets its own scope and tools, returns a verifiable artifact, and records the handoff.

Example workflow

You tell the orchestrator: "Research competitors in the CRM space and draft a blog post about our differentiators."

The orchestrator:

Routes the research task to your Research agent
Takes the research output and routes the writing task to your Content agent
Synthesizes the outputs into a final deliverable
Returns it to you for review

You still review the final output. You're not out of the loop. You're just not manually routing between agents anymore.

What to do at Level 3

Set up task tracking. Kanban-style works well. You need visibility into what each agent is working on, what's queued, and what's done.
Define handoff protocols. What does the research agent pass to the content agent? What format? What level of detail? Ambiguous handoffs create ambiguous output.
Review regularly. Quality issues compound fast in multi-agent setups. A small drift in the research agent's output becomes a big problem by the time it's been through two more agents.

When to move on

When the orchestrator's routing decisions are consistently correct and the specialist outputs consistently meet your quality bar without heavy editing.

Level 4: Automated Team

Cron Job / Trigger Events → Orchestrator Agent
                       ↓
                 Full Agent Team

This is where you step out of the loop for routine work. Cron jobs and event triggers fire tasks into the orchestrator. The orchestrator routes them to the team. The team handles the work asynchronously.

What this looks like in practice

Every Monday at 8am, the orchestrator triggers your SEO agent to pull keyword rankings, your content agent to draft the weekly newsletter outline, and your ops agent to generate a metrics report.
When a new competitor blog post is published (event trigger), the research agent analyzes it and the content agent drafts a response piece.
When a support ticket hits a specific tag, the ops agent drafts a response for your review queue.

The task bus handles queuing and routing. Agents pick up work, complete it, and log results. You check in when you want to, not because you have to.

What to do at Level 4

Start with one automated workflow, not ten. Get one cron job running reliably before adding more. Debugging a broken automation is harder when you have twelve of them running simultaneously.
Build in quality gates. Not every output needs your review, but have the orchestrator flag anything that falls below a confidence threshold for human review.
Monitor closely at first. The trust you build here is earned, not assumed. Look at outputs daily for the first two weeks, then taper to spot-checks.

The Part That Matters More Than Any of This

Take small steps. You do NOT want to automate slop.

If your output at Level 1 is mediocre, you are about to scale mediocrity. 20 agents shipping low-quality work at speed is worse than 3 shipping great work slowly. Every level multiplies whatever quality you've established at the level before it.

I'd rather run fewer agents with better output than max the agent count and spit out more of the same.

The progression isn't about moving fast. It's about moving when you're ready. Level 1 might take you a month. Level 2 might take another month. That's fine. The agents aren't going anywhere. Your quality bar is what matters.

Resources

I write about practical AI agent workflows, open-source tools, and the infrastructure behind them at Web After AI. No hype, just stuff you can actually use.

4 GitHub Repos That Prove AI Agents Aren't Just for Coding Anymore

Shilpa Mitra — Thu, 21 May 2026 17:08:15 +0000

Six months ago, "AI agent" basically meant "coding assistant." Claude Code, Copilot, Cursor. All doing the same thing: helping you write code.

That's changing. The most interesting open-source projects right now aren't building yet another coding agent. They're building agents that specialize: agents that trade stocks, agents that run your entire content marketing operation, agents that make your coding agent actually follow engineering discipline.

The model is the same underneath. The harness around it is what makes it useful for a specific job.

Here are four repos that show where this is heading, with setup instructions for each.

1. mattpocock/skills (91.7K stars) — Make Your Coding Agent an Actual Engineer

Repo: github.com/mattpocock/skills

Matt Pocock (the TypeScript educator behind Total TypeScript) open-sourced his personal .claude directory. It's a collection of skills that fix the most common failure modes of AI coding agents: building the wrong thing, skipping tests, producing code that works but is impossible to maintain, and declaring "done" when nothing actually compiles.

Most people treat their coding agent like an intern with no process. Matt's skills give it the process.

The standout: `/grill-me`

This skill forces the agent to interrogate you about what you actually want before writing a single line of code. It's a structured interview that catches misalignment before it becomes a wasted hour. There's also /grill-with-docs, which does the same thing but additionally builds a shared vocabulary between you and the agent in a CONTEXT.md file.

The CONTEXT.md approach is quietly brilliant. Instead of the agent using 20 words to describe something, you teach it your project's jargon. Over time, the agent's outputs get shorter, more precise, and the variables and functions it creates use consistent naming. It also reduces token usage, because concise terminology means shorter prompts and responses.

Other skills worth knowing

/tdd — Test-driven development with red-green-refactor. The agent writes a failing test first, then fixes it. Far better code quality than "write the feature, then maybe add tests."
/diagnose — Disciplined debugging loop: reproduce, minimise, hypothesise, instrument, fix, regression-test.
/improve-codebase-architecture — Finds structural improvements using your project's domain language from CONTEXT.md.
/handoff — Compacts the current conversation into a handoff document so another agent (or a new session) can continue the work without losing context.
/caveman — Ultra-compressed communication mode. Cuts token usage by roughly 75% while keeping full technical accuracy. Useful when you're burning through credits.

Setup

npx skills@latest add mattpocock/skills

Pick the skills you want and which coding agents to install them on. Make sure you select /setup-matt-pocock-skills during install. Then run that command in your agent, and it'll configure your issue tracker (GitHub, Linear, or local files), triage labels, and docs location. Works with Claude Code, Cursor, Codex, and others.

How it compares to Addy Osmani's agent-skills

If you've seen addyosmani/agent-skills, you might wonder how these differ. Addy's skills focus on the full development lifecycle with slash commands like /spec, /plan, /build, /ship. Matt's skills focus more on engineering fundamentals: alignment, testing discipline, debugging, and architecture quality. They're complementary, not competing. You can run both in the same project.

2. AI-Trader (13.7K stars) — Let AI Agents Trade for You

Repo: github.com/HKUDS/AI-Trader

AI-Trader is an agent-native trading platform built by researchers at the University of Hong Kong. The core idea: just like humans have their trading platforms, AI agents need their own.

You connect your AI agent (Claude Code, Cursor, OpenClaw, Codex, whatever), and it can publish trading signals, copy trades from top-performing agents, participate in strategy discussions, and access real-time market data across stocks, crypto, forex, options, and futures.

Why it's interesting

This isn't just one agent making trades. It's a platform where multiple agents collaborate, debate strategies, and learn from each other. They call it "collective intelligence trading."

Agents publish three types of signals:

Strategies — for discussion and analysis
Operations — for direct copy trading
Discussions — for collaborative reasoning

There's a reward system where agents earn points for successful predictions, and a $100K paper trading mode so you can test without risk.

Setup

The simplest way to connect an agent:

Read https://ai4trade.ai/SKILL.md and register.

Send that message to your AI agent. It reads the integration guide, installs the necessary components, and registers itself on the platform. For human traders, visit ai4trade.ai and sign up directly.

For developers who want to self-host:

git clone https://github.com/HKUDS/AI-Trader.git
cd AI-Trader
npm install

The backend is FastAPI (Python), frontend is React. Full OpenAPI docs are in docs/api/openapi.yaml.

A word of caution

Automated trading carries real financial risk. AI-Trader includes paper trading mode for a reason. Start there. The fact that it comes from a university research group rather than a fintech startup trying to sell you something is a point in its favor, but treat any trading system with healthy skepticism.

3. AiToEarn (12.2K stars) — AI Agent for Content Marketing Across 14 Platforms

Repo: github.com/yikart/AiToEarn

AiToEarn is an open-source content marketing platform with an AI agent built in. You create content once, and it publishes across 14 platforms simultaneously: TikTok, YouTube, Instagram, Twitter/X, LinkedIn, Pinterest, Facebook, Threads, plus Chinese platforms like Douyin, Xiaohongshu (Rednote), Bilibili, WeChat, and Kuaishou.

The "All In Agent"

This is the interesting part. It's an AI agent that can automatically generate content, publish it, and manage your accounts across all platforms. Beyond publishing, the platform includes:

Trend radar — what's going viral right now across platforms
Case library — how posts with 10K+ likes were structured
Smart comment search — finds high-conversion signals like "link please" or "how to buy" across your accounts
Cross-platform analytics — unified dashboard for all your channels

The comment search feature is particularly useful for anyone doing content-driven sales. It surfaces purchase-intent comments so you can reply fast and convert.

Setup

Docker (recommended):

git clone https://github.com/yikart/AiToEarn.git
cd AiToEarn
docker compose up -d

This starts the frontend, backend, MongoDB, and Redis in one command. Access the web interface at http://localhost:8080. There's also an Electron desktop app available from the GitHub releases page.

Note on documentation

The project originated in China. The English README and Docker deployment guide are solid, but some deeper configuration docs are still in Chinese. AI video model integrations (Kling, Sora, Runway, etc.) are listed as coming soon.

4. DeepSeek-TUI (Trending) — Claude Code, but for DeepSeek

Repo: github.com/Hmbown/DeepSeek-TUI

A terminal-based coding agent built specifically for DeepSeek models. If you've used Claude Code, the experience is similar: you type prompts in your terminal, the agent reads your files, edits code, runs shell commands, does git operations, and browses the web. The difference is it's built from the ground up for DeepSeek's API, which is significantly cheaper than Claude or GPT-4.

Three modes

Mode	What it does
Plan	Review a plan before the agent makes changes
Agent	Default interactive mode with multi-step tool use
YOLO	Auto-approve everything in a trusted workspace

Tab to cycle between them. It also supports MCP servers, session resume, and can run as an HTTP/SSE API server.

Built in Rust, so it's fast and lightweight.

Setup

npm install -g deepseek-tui
deepseek-tui

On first launch it'll ask for your DeepSeek API key. You can also set it beforehand:

deepseek-tui login

Or via environment variable:

DEEPSEEK_API_KEY="your-key" deepseek-tui

Configuration lives in ~/.deepseek/config.toml. Useful commands: deepseek-tui doctor (check setup), deepseek-tui models (list available models).

Also available via Rust:

cargo install deepseek-tui --locked

The Pattern

What connects all four of these: the model isn't the product anymore. The harness is.

Matt Pocock's skills don't change what Claude can do. They change how disciplined it is. AI-Trader doesn't invent a new trading algorithm. It builds a platform where existing agents collaborate. AiToEarn doesn't create a new content AI. It builds distribution infrastructure around existing ones. DeepSeek-TUI takes the Claude Code interaction pattern and wraps it around a different, cheaper model.

Every one of these is the same insight applied to a different domain: wrap the right structure around a capable model, and you get something genuinely useful. The structure is where the value is.

This is what the industry is starting to call harness engineering, the practice of building the environment, constraints, and feedback loops around an AI agent so it produces reliable results. It's not prompt engineering. It's not fine-tuning. It's designing the system the model operates inside.

If you want to go deeper on this and see how to actually chain free tools into a working setup, I wrote a step-by-step breakdown of building a zero-cost AI coding stack (9router + agentmemory + agent-skills) in my newsletter: Web After AI.

What specialized AI agents are you seeing in your domain? Drop a comment. I'm collecting examples for a follow-up piece.

DEV Community: Shilpa Mitra

3 Open-Source Repos That Each Kill a Different AI Bill(token, infra, creative)

TL;DR

Cut your token bill: codebase-memory-mcp

Cut your agent-infrastructure bill: flue

Cut your creative-tool bill: OpenMontage

How to choose

Where this leaves you

How to Set Up Persistent Memory for Codex Using Obsidian (3 Approaches)

First: Understanding How Codex Memory Actually Works

Layer 1: AGENTS.md (Static Instructions)

Layer 2: Native Memories (Auto-Generated)

Where Obsidian Comes In

Approach 1: Basic Memory + MCP (Easiest Setup, Cross-Tool Compatible)

What this looks like in practice

Setup (Local)

Setup (Cloud, for remote access)

Connecting Obsidian

When to use this approach

Approach 2: Structured Obsidian Vault with AGENTS.md + Codex Hooks (Deepest Integration)

What this looks like in practice

The vault structure

Step 1: Create AGENTS.md at the vault root

Step 2: Set up the SessionStart hook

Step 3: Open the vault in Obsidian

Step 4: Run Codex from the vault directory

What sets this apart from the other approaches

When to use this approach

Approach 3: Native Memories + Manual Obsidian Sync (Minimal Setup, Good Enough for Most)

Setup

What this looks like in practice

When to use this approach

Which Approach Should You Pick?

Quick Gotchas

How Claude Code Achieves a 92% Cache Hit Rate: A Deep Dive Into Prompt Caching for AI Agents

The Two Parts of Every Prompt

What's Actually Being Cached: The Transformer Angle

The Economics

Tracking a Real Claude Code Session

The Production Numbers

The One Thing That Will Tank Your Cache Hit Rate

1. Don't change your tool set mid-session

2. Don't switch models mid-conversation

3. Don't edit the system prompt to update state

Applying This to Your Own Agents

Prompt structure matters

Use auto-caching

Compact without breaking the cache

Monitor your hit rate

Key Takeaways

The 4 Levels of Hermes Agent Scaling Framework: From One Hermes Agent to a Fully Automated Team

First: What Hermes Agent Is

Level 1: The Main Agent

What to do at Level 1

When to move on

Level 2: Specialized Agents

Why separate instances?

How to do this practically

What to do at Level 2

When to move on

Level 3: Orchestrated Team

Example workflow

What to do at Level 3

When to move on

Level 4: Automated Team

What this looks like in practice

What to do at Level 4

The Part That Matters More Than Any of This

Resources

4 GitHub Repos That Prove AI Agents Aren't Just for Coding Anymore

The standout: /grill-me

Other skills worth knowing

Setup

How it compares to Addy Osmani's agent-skills

2. AI-Trader (13.7K stars) — Let AI Agents Trade for You

Why it's interesting

Setup

A word of caution

3. AiToEarn (12.2K stars) — AI Agent for Content Marketing Across 14 Platforms

The "All In Agent"

The standout: `/grill-me`