Domenico

Posted on May 29

How I Cut Token Costs 60% During Agent Development with Dependency-Aware Replay

#ai #agents #llm #productivity

How I Cut Token Costs 60% During Agent Development with Dependency-Aware Replay

When building agentic workflows with multiple LLM calls, I kept hitting the same frustrating pattern during development:

I'd tweak one small downstream step (like a validator or formatter)
Every single run would re-execute the entire pipeline
I'd burn hundreds of tokens waiting for the same upstream LLM calls to complete
After 10 iterations, I'd spent $5+ on API calls just for debugging

The problem wasn't just slow iteration — it was that I was rerunning work that hadn't changed at all.

The Root Problem: No Dependency Awareness

Most caching solutions for LLM applications focus on the model-call boundary. They cache responses for identical prompts. But that misses the bigger picture:

If your retrieval step didn't change, why rerun the downstream synthesis step?
If your prompt construction is deterministic, why re-execute it every time?
If you only changed the final formatter, why pay for 3 upstream LLM calls again?

The missing piece is dependency-aware replay: knowing which steps actually need to rerun based on what changed upstream, not just blindly re-executing everything.

Enter Musubito

I built musubito to solve this exact problem. It's a lightweight Python runtime that tracks execution lineage and decides per-step whether to replay based on dependency identity, not just input/output matching.

pip install musubito

The key idea: Musubito stores a Directed Acyclic Graph (DAG) of your execution in SQLite. Each node in the DAG represents a decorated function call, identified by a deterministic hash of its operation name, input content, and upstream producer IDs. When you rerun a pipeline, it checks the dependency graph and skips everything that is provably unchanged.

How It Works

Step Types

Musubito separates operations into three categories, each with different replay expectations:

from musubito import step, StepType, StepConfiguration

# DETERMINISTIC: pure functions — cache indefinitely
@step(step_type=StepType.DETERMINISTIC)
def construct_prompt(context: str, constraints: dict) -> str:
    return f"Context: {context}\nConstraints: {constraints}"

# STOCHASTIC: LLM calls — cache with a TTL
@step(step_type=StepType.STOCHASTIC, ttl_seconds=3600)
def call_llm(prompt: str) -> str:
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# EXTERNAL_EFFECT: side effects — store artifact, don't repeat the call
@step(step_type=StepType.EXTERNAL_EFFECT)
def send_to_api(payload: dict) -> dict:
    return requests.post("https://api.example.com/submit", json=payload).json()

Fan-in Support

Agent workflows frequently aggregate multiple upstream branches. Musubito makes this explicit with musubito_merge():

from musubito import step, StepType, musubito_merge

@step(step_type=StepType.STOCHASTIC, ttl_seconds=3600)
def retrieve_context(query: str) -> str:
    # ... retrieval logic
    pass

@step(step_type=StepType.STOCHASTIC, ttl_seconds=3600)
def extract_constraints(query: str) -> dict:
    # ... constraint extraction
    pass

@step(step_type=StepType.STOCHASTIC, ttl_seconds=3600)
def draft_answer(context: str, constraints: dict) -> str:
    # ... LLM drafting — depends on BOTH upstream branches
    pass

# Explicit fan-in: draft_answer depends on both retrieve_context and extract_constraints
context_result = retrieve_context("What is quantum computing?")
constraints_result = extract_constraints("What is quantum computing?")

with musubito_merge(context_result, constraints_result):
    draft = draft_answer(context_result.value, constraints_result.value)

Without musubito_merge, Musubito cannot know which producer nodes are structural parents of the aggregate step. With it, the full fan-in DAG is preserved in the lineage store.

Downstream Invalidation

When any upstream step changes (different input, expired TTL, forced re-execution), Musubito automatically marks all reachable descendants as STALE using a recursive CTE:

Node A (changed) → Node B (STALE) → Node C (STALE)
                ↘ Node D (stable, different branch — NOT invalidated)

Only the nodes reachable from the changed node are invalidated. Stable branches in a fan-in DAG remain valid, which is the key difference from a simple top-down cache flush.

The Impact on a Typical Pipeline

Consider a standard 5-step agentic pipeline:

Step	Type	Cold Run	Warm Run
Retrieve context	STOCHASTIC	300ms + tokens	~1.5ms (replayed)
Extract constraints	STOCHASTIC	300ms + tokens	~1.5ms (replayed)
Draft answer (LLM)	STOCHASTIC (TTL 1h)	700ms + tokens	~1.5ms (replayed)
Validate output	DETERMINISTIC	50ms	~0.5ms (replayed)
Format result	DETERMINISTIC	20ms	~0.5ms (replayed)
Total		~1370ms	~5.5ms

During iterative development, after the first cold run:

Zero tokens burned for steps whose inputs didn't change
~1.5ms cross-session replay from SQLite (survives process restarts)
Only the changed step and its descendants rerun

In practice, if you're tweaking the validator (step 4), only steps 4 and 5 rerun. Steps 1–3 replay from lineage. Token savings: 60–80% depending on pipeline depth.

Why Not Just Use lru_cache?

Python's functools.lru_cache is great for same-process memoization. But it has real limitations for agentic workflows:

Feature	`lru_cache`	`musubito`
Cross-session persistence	❌ Lost on restart	✅ SQLite-backed
Dependency tracking	❌ None	✅ Full DAG with CTE
Fan-in support	❌ None	✅ `musubito_merge()`
Step-type semantics	❌ None	✅ Deterministic / Stochastic / External
TTL-based expiry	❌ None	✅ Per-step TTL
Artifact inspection	❌ None	✅ SQL-queryable SQLite
Downstream invalidation	❌ None	✅ Recursive CTE

Musubito doesn't replace lru_cache — it addresses a different layer. lru_cache is for same-process memoization of pure functions. Musubito is for persistent, dependency-aware lineage management across an entire agentic pipeline.

What About MLflow, DVC, or LangSmith?

These are great tools, but they solve different problems:

MLflow / DVC: Experiment tracking, model versioning, dataset management — they track runs, not per-step dependencies within a run
LangSmith / Langfuse: Observability and tracing — they tell you what happened, but don't replay it

Musubito fills the gap: incremental execution of a multi-step pipeline where you only rerun what changed.

Getting Started

pip install musubito

Minimal working example:

from musubito import step, StepType

@step(step_type=StepType.DETERMINISTIC)
def preprocess(data: dict) -> str:
    return str(sorted(data.items()))

@step(step_type=StepType.STOCHASTIC, ttl_seconds=3600)
def call_model(prompt: str) -> str:
    # your LLM call here
    return f"response to: {prompt}"

@step(step_type=StepType.DETERMINISTIC)
def postprocess(response: str) -> dict:
    return {"result": response, "length": len(response)}

# First run: cold execution
raw = preprocess({"query": "what is AI?"})
response = call_model(raw.value)
final = postprocess(response.value)

# Second run: everything replays from SQLite (~5ms total)
raw2 = preprocess({"query": "what is AI?"})
response2 = call_model(raw2.value)
final2 = postprocess(response2.value)

Current Limitations (Honest Take)

Musubito is early-stage and not yet suitable for every use case:

Single-writer SQLite: not designed for multi-process concurrent writes
No semantic caching: identity is exact hash-based, not semantic similarity
Local storage only: no remote backend yet (planned for future versions)

These are known tradeoffs, documented in the paper currently under review.

Resources

GitHub: github.com/DAltieri86/musubito
PyPI: pip install musubito
Paper: under review

If you're building agentic workflows and struggling with iteration speed or token costs during development, give musubito a try and let me know what you think. Feedback on the fan-in API and step-type semantics especially welcome.

Top comments (2)

Harjot Singh • May 31

This is basically incremental builds applied to agent pipelines, and it's a genuinely underused idea. The dev loop re-calling the LLM for steps that didn't change is the same waste a compiler would have if it rebuilt every file on every keystroke, you're paying real tokens to regenerate byte-identical outputs just because the step downstream changed. Caching step outputs and replaying only what actually changed is the Make/memoization move, and 60% sounds right because during development the vast majority of a run is unchanged prefix while you iterate on one step near the end. The hard and interesting part, and where this lives or dies, is the dependency-aware bit: correct invalidation. Cache too aggressively and you replay a stale cached output after an upstream input quietly changed, so now you're iterating against a wrong baseline and the bug you're chasing isn't even real, the classic cache-invalidation footgun. So the dependency graph (what each step actually depends on, including prompt, model, and upstream outputs) is the whole ballgame, get the keys right and it's free speed, get them wrong and it's silent staleness. Cache what didn't change, but be ruthless about what counts as changed. That spend-tokens-only-on-what-actually-changed instinct is core to how I think about cost in Moonshift. How are you keying the cache, hashing the full input set including prompt and model version, or tracking dependencies more structurally?

Domenico • Jun 1

Thanks, this is a really good read of it.
You nailed the hard part. The whole bet is on getting the keys right, and here’s exactly how Musubito does it.
Every node ID is a SHA-256 over three things combined-->the full canonical hash of all function inputs (prompt, model string, every parameter, anything passed to the decorated function gets serialized and hashed), the operation name (the fully qualified module.qualname of the function), and the sorted set of upstream node IDs. That last part is the structural dependency tracking you’re asking about, it’s not just “what did I receive as input values”, it’s “which specific prior execution produced those values”. So if an upstream step reruns and produces a different artifact, its node ID changes, which cascades and changes the child node ID too, which means the cached child result is simply never found, no explicit invalidation needed, the identity itself breaks.
The silent staleness footgun you described, iterating against a wrong baseline, is exactly what the upstream node ID in the key is meant to prevent. If the upstream output changed, the downstream node is a different node by definition.
Where it can still go wrong is if you pass an upstream .value directly as a plain argument without going through musubito_merge() . In that case Musubito sees the value but not the lineage, so it won’t know the provenance changed if the value happened to be identical. That’s the main footgun to watch for right now.
Curious about Moonshift, are you doing something similar structurally or more at the runtime/scheduler level?