I thought plugging in LangSmith would solve agentic AI monitoring
In my last post, I shared the "costs are a black box" problem.
Some issues used 10,000 tokens, others 1,000,000. A 100x difference with no explanation.
I figured switching to LangGraph and adding LangSmith would fix it.
Wrong.
First wall: CLI calls can't be traced
The original setup ran Claude Code CLI as a subprocess from GitHub Actions.
GitHub Actions → npx claude-code → (black box) → result
Even with LangSmith connected, LLM calls inside the CLI were invisible. The problem wasn't the tooling—it was that the architecture wasn't observable in the first place.
So I switched from CLI to the Anthropic SDK.
——
Second wall: tokens don't show up in LangSmith
Switched to SDK. Still no token counts.
After debugging, I learned:
You need run_type=llm for tokens to be tracked
input_cost and output_cost must be manually added as metadata
I assumed an "observability tool" would show everything automatically. Turns out, defining what you want to see is your job.
Third wall: the trap of over-engineering
I tried to build for scale and include LangGraph V1's new features.
- Durable State & Built-in Persistence
- Scoring, observability, model config
- Fallback API logic
In the end, I deleted all of it.
Once I actually ran it, none of that was needed. Most settings were better hardcoded. Unnecessary logic just added complexity.
Result: now I can see the costs
The cause of that 100x difference finally became visible:
Triage → input 1.5K / output 102 tokens → $0.002
Analyze → input 18K / output 703K tokens → $0.105
Fix → input 50K / output 3,760K tokens → $0.512
The shift from "running on gut feel" to "running on numbers."
Plugging in an observability tool is easy. Building an observable architecture is the real work.
Next: a dashboard that shows whether these numbers actually deliver value—productivity metrics in real time.


Top comments (0)