Stephen Phillips

Posted on Jul 4

Your AI agent needs receipts, not vibes: tracing MCP workflows for small businesses

#ai #mcp #observability #automation

A small-business AI agent is easy to demo and surprisingly hard to trust.

The demo looks clean: connect the agent to email, invoices, a CRM, maybe a few n8n workflows, then ask it to chase unpaid invoices or triage customer messages. It calls the right tools. It writes a neat summary. Everyone nods.

Then Monday happens.

A customer asks why they got the wrong follow-up. The owner wants to know whether the agent actually checked the CRM before it emailed them. The developer opens a log file and finds a pile of model prompts, HTTP requests, half-useful timestamps, and no obvious story.

That is the line between an AI automation toy and an AI automation system: can you reconstruct what happened after the agent did something real?

For HappyMonkey-style small-business automation, this is becoming the next practical problem after "can the agent call tools?" MCP gives agents a standard way to connect to external systems. n8n and similar workflow tools give teams a place to run repeatable business processes. OpenTelemetry and agent observability patterns give you the missing receipts.

MCP gives the agent hands

The Model Context Protocol describes itself as an open standard for connecting AI applications to external systems: local files, databases, tools, search engines, workflows, prompts, and more. The docs use the USB-C metaphor, which is fair enough. The useful part is not the metaphor. It is the separation.

Instead of hardcoding every integration directly into an agent, you expose capabilities through MCP servers. The agent can discover a tool, call it with structured inputs, and get structured results back.

That matters for small businesses because their software stack is usually messy. The accountant lives in one system. Leads come from a website form. Bookings land in a calendar. The owner still forwards important emails manually. If every integration needs bespoke agent code, the project dies under maintenance.

MCP helps by making tool access more regular. But it does not magically make the work safe. Once an agent can touch real systems, you need to know what it touched, why it touched it, how long it took, what it cost, and what it returned.

Tool access without observability is just a more confident black box.

One practical pattern here is a dynamic MCP gateway rather than a static pile of MCP servers. We use DynamicMCPProxy locally for this: the IDE connects to one proxy, sends project/task context, and the proxy lazily activates the relevant MCP servers while keeping the active tool count under control. That solves the "tool soup" problem, but it also creates a useful control point for receipts: every server activation, tool call, latency, and outcome can pass through one place before it reaches the agent.

Workflows give the agent rails

This is where tools like n8n fit. n8n's own 2026 guide frames AI workflow automation as different from traditional app-to-app automation: the AI layer can interpret, decide, generate, and adapt as it runs, while the workflow layer still provides structure.

That split is useful. Let the agent decide which business action is needed, but put the actual action behind a workflow that has retries, validation, credentials, and predictable side effects.

For example:

lookup_customer_balance
send_payment_reminder_for_invoice
create_follow_up_task
summarize_new_leads_from_website
draft_wordpress_post_from_source_notes

Each of those can be an MCP-exposed tool or a workflow behind an MCP server. The agent chooses. The workflow executes.

But again, the question after a failure is not "did we use MCP?" It is "what exactly happened?"

If an invoice reminder went to the wrong person, you need a trace that answers boring questions quickly:

Which user request started this run?
Which model answered?
Which tools did it call?
What arguments did it pass?
Which workflow ran?
Did the workflow retry?
What did the external API return?
Did a human approve the final action?

The boring questions are the business-critical ones.

Traces are better than giant logs

A normal application log says, "this thing happened at this time." That is useful, but agent runs are nested. A single request might include planning, retrieval, model calls, tool calls, workflow calls, retries, and a final response.

A trace gives you the tree.

OpenTelemetry is already the common language for tracing normal distributed systems. Its GenAI semantic conventions now include concepts for model requests, token usage, and related AI operations. The OpenTelemetry docs list Gen AI semantic conventions alongside other standard instrumentation areas, and the ecosystem is moving toward treating model calls and agent steps as first-class spans rather than random log lines.

The CNCF has also been talking about Jaeger evolving for AI-agent traces with OpenTelemetry. That is a strong signal: agent observability is not just an LLM tooling niche. It is getting pulled into the same operational world as services, queues, databases, and APIs.

For small-business automation, you probably do not need a huge observability platform on day one. You do need the shape of the data to be right.

A practical trace might look like this:

customer_email_triage run
  model.plan
  mcp.tool.search_customer_by_email
  mcp.tool.get_recent_orders
  workflow.n8n.create_support_ticket
  model.draft_reply
  human.approval.requested
  email.send

Each span should carry just enough metadata:

span name: mcp.tool.get_recent_orders
customer_id: cust_123
workflow_run_id: n8n_456
latency_ms: 820
status: ok
records_returned: 3

Do not log private customer content by default. Log IDs, counts, status, latency, cost, tool names, model names, and approval state. Keep the sensitive payload somewhere controlled, if you need to keep it at all.

That one design choice matters. Small businesses often want local AI or self-hosted workflows because they care about privacy, cost, or control. Observability should not undo that by spraying customer emails into a third-party logging account.

The minimum viable agent receipt

If you are building this stack for a client, start smaller than you think.

For every agent run, save a receipt with:

Trigger: user message, cron job, webhook, or incoming email.
Decision: the short reason the agent chose a tool or workflow.
Tool calls: tool name, arguments with sensitive fields redacted, status, duration.
Workflow calls: workflow ID, run ID, status, retry count.
Model usage: model name, latency, token count or local runtime estimate.
Human gate: whether a human approved, edited, or blocked the action.
Outcome: what changed in the real world.

That receipt can be a JSON file at first. It can become OpenTelemetry spans when the system grows. The important thing is to design as if someone will ask, "why did the agent do that?" because someone will.

Here is a simple JSON shape:

{
  "run_id": "run_2026_07_04_001",
  "trigger": "inbound_customer_email",
  "agent": "support_triage_agent",
  "model": "local-llm-via-ollama",
  "tool_calls": [
    {
      "name": "mcp.tool.search_customer_by_email",
      "status": "ok",
      "duration_ms": 210,
      "redacted_args": { "email_hash": "..." }
    }
  ],
  "workflow_calls": [
    {
      "name": "n8n.create_support_ticket",
      "run_id": "n8n_456",
      "status": "ok"
    }
  ],
  "human_approval": "required_before_send",
  "outcome": "drafted_reply_and_created_ticket"
}

This is not glamorous. It is the stuff that keeps the owner from losing confidence the first time a workflow behaves oddly.

A concrete gateway example

In our own stack, this is the direction we have taken with DynamicMCPProxy. It started as a way to stop MCP "tool soup": connect the IDE to one proxy, let the proxy choose the right MCP servers for the current project, and keep the active tool list within a sensible budget.

The same gateway is also the right place to add receipts. The latest version records JSONL receipt events for handshakes, server activation, lazy materialisation, and child MCP tool calls. Those records include a run_id, span_id, event type, caller identity, status, latency, server name, runtime, and argument keys.

The important security detail is what it does not record. HMAC-authenticated sidecar calls are logged as a caller such as service:hmac, but the HMAC key itself is never written. Tool arguments are summarized as keys, types, lengths, and hashes rather than raw customer content. Optional OpenTelemetry export can mirror the trace shape later, but the local JSONL receipt still works without shipping sensitive payloads to an observability vendor.

That is the pattern I would use for most small-business agent systems: local receipts first, OpenTelemetry when the workflow has enough traffic or enough risk to justify distributed tracing.

What I would instrument first

If you are working with a small business, do not start by instrumenting every prompt token and edge case. Start with the places that create support pain or financial risk.

For an invoice workflow, trace customer lookup, invoice lookup, payment status, reminder generation, approval, and send.

For a lead workflow, trace source, deduplication, enrichment, CRM write, notification, and follow-up task creation.

For a WordPress/content workflow, trace source URLs, summarization, draft creation, image generation, human review, and publish state. Especially publish state. Nobody wants an agent accidentally publishing a draft because a boolean default was wrong.

For a local AI setup with Ollama, trace runtime and fallback behavior. If a local model fails and the system falls back to a cloud model, that should be visible. If a workflow silently switches models, your privacy story has a hole in it.

The sales angle is reliability, not magic

A lot of small-business AI pitches still sound like magic: "we will automate your operations with agents." Owners have heard enough of that.

A better pitch is more concrete:

"We will automate one repetitive workflow. You will be able to see every tool the agent used, every workflow it triggered, whether a human approved it, and what changed. If something goes wrong, we can replay the receipt."

That is less flashy, but it is more believable.

MCP makes the integrations less brittle. Workflow tools make the actions repeatable. Observability makes the whole thing accountable.

That combination is the difference between a clever prototype and a service you can charge for every month.

Source notes

Model Context Protocol documentation: MCP is an open-source standard for connecting AI applications to external systems such as data sources, tools, and workflows. https://modelcontextprotocol.io/docs/getting-started/intro
OpenTelemetry GenAI semantic conventions: OpenTelemetry maintains GenAI conventions alongside standard tracing conventions, including GenAI spans and MCP-related areas in the semantic convention registry. https://opentelemetry.io/docs/specs/semconv/gen-ai/
CNCF / Jaeger: Jaeger is evolving to trace AI agents with OpenTelemetry. https://www.cncf.io/blog/2026/05/26/how-jaeger-is-evolving-to-trace-ai-agents-with-opentelemetry/
n8n: AI workflow automation combines AI decision/generation with standardized workflows, with tools like n8n positioned for flexible automation. https://blog.n8n.io/best-ai-workflow-automation-tools/
DynamicMCPProxy: a dynamic MCP gateway that now includes security-first JSONL receipts and optional OpenTelemetry export. https://github.com/HappyMonkeyAI/DynamicMCPProxy
Social trend signal: X search on 2026-07-04 showed live discussion around MCP, n8n, Ollama, and tool routing for practical small-business automation.