The real problems I faced after deploying agentic AI to production

#agents #ai #llm #monitoring

In my last post, I shared that agentic AI cut issue handling time by 66%.
The numbers looked great on paper. But the moment I asked "can we actually keep running this?"—unexpected problems started piling up.

Costs are a black box

Agents make multiple LLM calls internally.
Reading code, analyzing, deciding, sometimes double-checking.

But I had no way to track how many tokens any of this used.
Some issues consumed 10,000 tokens. Others burned 1,000,000.
A 100x difference—and I couldn't tell why.

The current setup runs Claude Code CLI inside GitHub Actions.

There's no system for systematically tracking and analyzing token usage.

I discovered that tools like LangSmith are built to solve exactly this. Next step: either parse CLI logs into structured data, or integrate a dedicated monitoring tool.

The workflow only exists as code

Right now the workflow is 4 YAML files. Each one is 100-300 lines, and the conditional branches keep getting more complex. When a teammate asks "what exactly is the AI doing right now?"—I have to open the code and explain line by line.

Trust is everything for agentic AI.

To explain "why did the AI make this decision?", you need to see the intermediate steps visually.

I'm looking into low-code platforms like LangFlow and n8n. If the visual workflow actually reflects what's running, with logs at each node, that would help a lot.

No real-time KPIs

"AI saved 66% of developer time" came from manually analyzing 10 issues.

But I have no idea how well the AI is performing right now.
Metrics I need: