How I Cut Token Costs 60% During Agent Development with Dependency-Aware Replay

Domenico — Fri, 29 May 2026 08:41:44 +0000

How I Cut Token Costs 60% During Agent Development with Dependency-Aware Replay

When building agentic workflows with multiple LLM calls, I kept hitting the same frustrating pattern during development:

I'd tweak one small downstream step (like a validator or formatter)
Every single run would re-execute the entire pipeline
I'd burn hundreds of tokens waiting for the same upstream LLM calls to complete
After 10 iterations, I'd spent $5+ on API calls just for debugging

The problem wasn't just slow iteration — it was that I was rerunning work that hadn't changed at all.

The Root Problem: No Dependency Awareness

Most caching solutions for LLM applications focus on the model-call boundary. They cache responses for identical prompts. But that misses the bigger picture:

If your retrieval step didn't change, why rerun the downstream synthesis step?
If your prompt construction is deterministic, why re-execute it every time?
If you only changed the final formatter, why pay for 3 upstream LLM calls again?

The missing piece is dependency-aware replay: knowing which steps actually need to rerun based on what changed upstream, not just blindly re-executing everything.

Enter Musubito

I built musubito to solve this exact problem. It's a lightweight Python runtime that tracks execution lineage and decides per-step whether to replay based on dependency identity, not just input/output matching.

pip install musubito

The key idea: Musubito stores a Directed Acyclic Graph (DAG) of your execution in SQLite. Each node in the DAG represents a decorated function call, identified by a deterministic hash of its operation name, input content, and upstream producer IDs. When you rerun a pipeline, it checks the dependency graph and skips everything that is provably unchanged.

How It Works

Step Types

Musubito separates operations into three categories, each with different replay expectations:

from musubito import step, StepType, StepConfiguration

# DETERMINISTIC: pure functions — cache indefinitely
@step(step_type=StepType.DETERMINISTIC)
def construct_prompt(context: str, constraints: dict) -> str:
    return f"Context: {context}\nConstraints: {constraints}"

# STOCHASTIC: LLM calls — cache with a TTL
@step(step_type=StepType.STOCHASTIC, ttl_seconds=3600)
def call_llm(prompt: str) -> str:
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# EXTERNAL_EFFECT: side effects — store artifact, don't repeat the call
@step(step_type=StepType.EXTERNAL_EFFECT)
def send_to_api(payload: dict) -> dict:
    return requests.post("https://api.example.com/submit", json=payload).json()

Fan-in Support

Agent workflows frequently aggregate multiple upstream branches. Musubito makes this explicit with musubito_merge():

from musubito import step, StepType, musubito_merge

@step(step_type=StepType.STOCHASTIC, ttl_seconds=3600)
def retrieve_context(query: str) -> str:
    # ... retrieval logic
    pass

@step(step_type=StepType.STOCHASTIC, ttl_seconds=3600)
def extract_constraints(query: str) -> dict:
    # ... constraint extraction
    pass

@step(step_type=StepType.STOCHASTIC, ttl_seconds=3600)
def draft_answer(context: str, constraints: dict) -> str:
    # ... LLM drafting — depends on BOTH upstream branches
    pass

# Explicit fan-in: draft_answer depends on both retrieve_context and extract_constraints
context_result = retrieve_context("What is quantum computing?")
constraints_result = extract_constraints("What is quantum computing?")

with musubito_merge(context_result, constraints_result):
    draft = draft_answer(context_result.value, constraints_result.value)

Without musubito_merge, Musubito cannot know which producer nodes are structural parents of the aggregate step. With it, the full fan-in DAG is preserved in the lineage store.

Downstream Invalidation

When any upstream step changes (different input, expired TTL, forced re-execution), Musubito automatically marks all reachable descendants as STALE using a recursive CTE:

Node A (changed) → Node B (STALE) → Node C (STALE)
                ↘ Node D (stable, different branch — NOT invalidated)

Only the nodes reachable from the changed node are invalidated. Stable branches in a fan-in DAG remain valid, which is the key difference from a simple top-down cache flush.

The Impact on a Typical Pipeline

Consider a standard 5-step agentic pipeline:

Step	Type	Cold Run	Warm Run
Retrieve context	STOCHASTIC	300ms + tokens	~1.5ms (replayed)
Extract constraints	STOCHASTIC	300ms + tokens	~1.5ms (replayed)
Draft answer (LLM)	STOCHASTIC (TTL 1h)	700ms + tokens	~1.5ms (replayed)
Validate output	DETERMINISTIC	50ms	~0.5ms (replayed)
Format result	DETERMINISTIC	20ms	~0.5ms (replayed)
Total		~1370ms	~5.5ms

During iterative development, after the first cold run:

Zero tokens burned for steps whose inputs didn't change
~1.5ms cross-session replay from SQLite (survives process restarts)
Only the changed step and its descendants rerun

In practice, if you're tweaking the validator (step 4), only steps 4 and 5 rerun. Steps 1–3 replay from lineage. Token savings: 60–80% depending on pipeline depth.

Why Not Just Use lru_cache?

Python's functools.lru_cache is great for same-process memoization. But it has real limitations for agentic workflows:

Feature	`lru_cache`	`musubito`
Cross-session persistence	❌ Lost on restart	✅ SQLite-backed
Dependency tracking	❌ None	✅ Full DAG with CTE
Fan-in support	❌ None	✅ `musubito_merge()`
Step-type semantics	❌ None	✅ Deterministic / Stochastic / External
TTL-based expiry	❌ None	✅ Per-step TTL
Artifact inspection	❌ None	✅ SQL-queryable SQLite
Downstream invalidation	❌ None	✅ Recursive CTE

Musubito doesn't replace lru_cache — it addresses a different layer. lru_cache is for same-process memoization of pure functions. Musubito is for persistent, dependency-aware lineage management across an entire agentic pipeline.

What About MLflow, DVC, or LangSmith?

These are great tools, but they solve different problems:

MLflow / DVC: Experiment tracking, model versioning, dataset management — they track runs, not per-step dependencies within a run
LangSmith / Langfuse: Observability and tracing — they tell you what happened, but don't replay it

Musubito fills the gap: incremental execution of a multi-step pipeline where you only rerun what changed.

Getting Started

pip install musubito

Minimal working example:

from musubito import step, StepType

@step(step_type=StepType.DETERMINISTIC)
def preprocess(data: dict) -> str:
    return str(sorted(data.items()))

@step(step_type=StepType.STOCHASTIC, ttl_seconds=3600)
def call_model(prompt: str) -> str:
    # your LLM call here
    return f"response to: {prompt}"

@step(step_type=StepType.DETERMINISTIC)
def postprocess(response: str) -> dict:
    return {"result": response, "length": len(response)}

# First run: cold execution
raw = preprocess({"query": "what is AI?"})
response = call_model(raw.value)
final = postprocess(response.value)

# Second run: everything replays from SQLite (~5ms total)
raw2 = preprocess({"query": "what is AI?"})
response2 = call_model(raw2.value)
final2 = postprocess(response2.value)

Current Limitations (Honest Take)

Musubito is early-stage and not yet suitable for every use case:

Single-writer SQLite: not designed for multi-process concurrent writes
No semantic caching: identity is exact hash-based, not semantic similarity
Local storage only: no remote backend yet (planned for future versions)

These are known tradeoffs, documented in the paper currently under review.

Resources

GitHub: github.com/DAltieri86/musubito
PyPI: pip install musubito
Paper: under review

If you're building agentic workflows and struggling with iteration speed or token costs during development, give musubito a try and let me know what you think. Feedback on the fan-in API and step-type semantics especially welcome.

DEV Community: Domenico

How I Cut Token Costs 60% During Agent Development with Dependency-Aware Replay

How I Cut Token Costs 60% During Agent Development with Dependency-Aware Replay

The Root Problem: No Dependency Awareness

Enter Musubito

How It Works

Step Types

Fan-in Support

Downstream Invalidation

The Impact on a Typical Pipeline

Why Not Just Use lru_cache?

What About MLflow, DVC, or LangSmith?

Getting Started

Current Limitations (Honest Take)

Resources