Sachin Magon

Posted on Jun 29 • Originally published at sachin.magonus.com

Things I learned building my first multi-agent AI system on Azure + NVIDIA

#ai #azure #nvidia #python

I recently built a multi-agent customer support system on Azure AI Foundry and NVIDIA NIM. First time doing anything like this. Made four predictions upfront about what would happen. Three of them were wrong.
Here is what I actually learned.

1. "Tokens" is not a unit of cost

It is a unit of work. The price per unit of work varies by 5-10x depending on which model did the work. I was tracking total token count across both the small 9B model and the large 49B model as if they cost the same. They do not. Total tokens went up in the optimized version. Cost in dollars probably went down. I was measuring the wrong thing the whole time.

2. A verbatim hash cache on natural language traffic deflects ~0% of queries

I predicted 25-40% cache deflection. The actual number was 0%. Every query in my test set was a unique string, so the hash-based cache never had a single chance to fire. A verbatim cache is not a simpler version of a semantic cache. It is a different thing entirely. If your workload is natural language, build semantic similarity caching from day one, not as an upgrade later.

3. configure_azure_monitor() does not capture OpenAI SDK calls by default

You need to install and initialize opentelemetry-instrumentation-httpx explicitly:

pip install opentelemetry-instrumentation-httpx==0.61b0

from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
HTTPXClientInstrumentor().instrument()

Without this, your App Insights Logs will show customMetric and
performanceCounter entries (CPU, memory) but nothing about what your
agent actually did.

4. Pin your OpenTelemetry versions or everything breaks

Installing opentelemetry-instrumentation-httpx without version pinning pulled in opentelemetry-api 1.42.1. But azure-monitor-opentelemetry-exporter needs opentelemetry-api==1.40. The conflict is silent until things start misbehaving. Pin everything to the 0.61b0 / 1.40.0 line:

pip install \
"opentelemetry-api==1.40.0" \
"opentelemetry-instrumentation==0.61b0" \
"opentelemetry-instrumentation-httpx==0.61b0" \
"opentelemetry-semantic-conventions==0.61b0" \
"opentelemetry-util-http==0.61b0"

Then run pip check to confirm no broken requirements.

5. Short-lived Python scripts exit before OTel's batch exporter fires

OpenTelemetry batches traces and sends them every few seconds in the
background. If your script finishes before that timer fires, the traces are dropped silently. Not delayed. Gone. Add an atexit flush:

import atexit
from opentelemetry import trace

def _flush():
provider = trace.get_tracer_provider()
if hasattr(provider, "force_flush"):
provider.force_flush()

atexit.register(_flush)

This guarantees buffered traces get pushed out before the process exits.

6. Nemotron Nano and Super put output in reasoning_content, not content

On short prompts, both models spend their token budget on internal
reasoning and never produce a content field. It comes back as None.
msg.content.strip() then crashes with AttributeError.

Always extract text like this:

text = (
msg.content
or getattr(msg, "reasoning_content", None)
or "(no response)"
)

This applies everywhere you read model output: classifiers, answer
functions, test scripts, all of it.

7. The NVIDIA model name in the catalog is not the API model string

nvidia/nemotron-nano-9b-v2 returns 404. The actual API string has a
double prefix:

nvidia/nvidia-nemotron-nano-9b-v2

Go to build.nvidia.com, open the model card, click the Python tab, and copy the model= value directly. Do not guess from the catalog name.

8. max_tokens=10 does not work for reasoning models doing classification

I set max_tokens=10 for my classifier call, expecting a one-word label back. Nemotron Nano spent all 10 tokens on reasoning trace and never produced a label. content came back None. Set at least max_tokens=100 for any call to a reasoning model, even simple classification tasks.

9. Routing decisions need their own log line

I built a router, ran 81 queries through it, got a full benchmark result, and still cannot tell you with confidence what it routed where. The per-category tables in my benchmark were grouped by ground-truth label, not by what the router actually decided. These are not the same thing.
Log the routing decision explicitly on every query, separate from
everything else.

10. Graceful degradation cannot be tested in a sequential benchmark

I built a downshift mechanism that triggers when the reasoning model's rolling p95 latency exceeds 4000ms. It requires 20 samples in the window before it can activate. My entire eval set had 12 reasoning queries. The mechanism was guaranteed to never trigger before I ran a single query.
To test saturation behavior you need either a much larger dataset or a dedicated concurrent load test, not a sequential single-pass benchmark.

11. Homebrew Python 3.12 has a libexpat conflict on some macOS versions

python3.12 -m venv .venv fails with:

ImportError: Symbol not found: _XML_SetAllocTrackerActivationThreshold

Homebrew's Python was compiled against a newer libexpat than what macOS ships. The fix is pyenv:

brew install pyenv
pyenv install 3.12.13
pyenv local 3.12.13
python -m venv .venv

Everything works after that.

12. The operational layer is harder than the model layer

Every mistake in this list is something I built, measured, or configured wrong. None of them are problems with Azure AI Foundry or NVIDIA NIM. The models worked. The platforms worked. The gaps were all in how I instrumented, tested, and measured the system around them.

That is probably the most useful thing I learned.

Full benchmark results and write-up on my blog:
https://sachin.magonus.com/2026/06/17/multi-agent-poc-benchmark-foundry-nvdia-nim-results/

Framework post (the architecture and predictions before I ran anything):
https://sachin.magonus.com/2026/01/16/multi-agent-framework-foundry-nvidia/

Top comments (1)

Tae Kim • Jul 1

The verbatim cache observation is the one I've seen trip up the most first-time multi-agent builders — natural language queries have near-zero hash collision rate, so an exact-match cache on that layer is essentially dead weight from the first request. Your point about OTel version pinning is also something I've hit hard in production: the silent failure mode where traces get dropped because the batch exporter fires after process exit is especially nasty because you get no error, just missing data in your observability backend. From running LangGraph agents in production I'd add one more gotcha: schema-validate tool call inputs at the agent boundary before they hit the tool, not after, so mismatches surface in your trace as agent errors rather than as silent wrong results downstream.