Ivo Brett

Posted on Mar 17

I Gave AI Agents a Telecom Job Interview. Most Failed Without a Cheat Sheet

#ai #agents #llm #automation

Introduction

Telecoms is one of the most API-driven industries on the planet. TM Forum has standardised hundreds of operations workflows across product catalogs, customer management, incident response, network topology, billing, and performance monitoring. If AI agents are going to automate telecom operations, they need to work reliably across all of them.

I wanted to find out: can today's open-weight LLMs actually do this? And if not — what closes the gap?

The answer led me to build something I'm calling SKILLS - a benchmark framework and a set of portable domain skill documents that give AI agents the operational knowledge they need to execute real telecom workflows. This is obviously a play on the terms Skills. Agent Skills are a new standard and according to https://agentskills.io are

folders of instructions, scripts, and resources that agents can discover and use to do things more accurately and efficiently.

This article covers what I built, what I found, and why the results matter for anyone building agentic AI in a regulated, API-heavy industry.

The Problem: Generalist Agents Hit a Wall

Figure 1: Generalist AI agents need domain-specific skills to handle telecom operations workflows.

Ask a general-purpose LLM agent to handle a task like this:

"Identify any cells with traffic anomalies greater than 3 standard deviations from their baseline, specifically looking at overnight patterns. We need to rule out unauthorized usage or configuration errors."

A capable agent will understand the problem. It will know what standard deviation means. It might even write reasonable analysis logic.

But it won't know that the TMF628 Performance Management API expects g_15mn for 5-minute granularity — not PT5M. It won't know the job creation lifecycle. It won't know which fields to filter, or in which order to call the APIs to get the data it needs.

It will hallucinate a plausible-looking answer, or fail validation, or call the wrong endpoint. Not because it's incapable — because it lacks domain knowledge that isn't in any training data.

That's the gap skills are designed to close.

What I Built: The SKILLS Benchmark

I built a benchmark of 37 telecom operations scenarios across 8 TM Forum Open API domains:

TMF API	Domain	Scenarios
TMF628	Performance Management	4
TMF629	Customer Management	11
TMF637	Product Inventory	2
TMF639	Resource Topology	4
TMF724	Incident Management	11
TMF620/621/622	Catalog, Tickets, Orders	5

Each scenario runs against live mock API servers backed by MongoDB, with seeded production-representative data. The agent has access to MCP tool interfaces for each server. Evaluation is three-layer: programmatic tool-call verification, LLM judge for response content, and database state assertions.

Scenarios span four complexity tiers from simple single-API lookups to Complex scenarios that require proprietary business logic the model cannot infer from schema alone — SLA weighting formulas, maintenance exclusion rules, specific TMF enumeration formats.

For each scenario I ran the agent twice:

Baseline — the agent has tools but no domain guidance
With-Skill — the agent receives a portable SKILL.md document encoding workflow steps, API patterns, and business rules

The delta is the skill lift.

What Are Skills?

A Skill is a portable Markdown document that gives an agent the operational knowledge for a specific telecom workflow. It encodes:

Which MCP servers and tools are required
The exact sequence of API calls and their parameters
Business rules and decision logic (e.g. SLA priority weights)
Domain-specific enumeration formats
Error handling patterns
Required output format

Skills are model-agnostic. They contain no code — only structured natural language instructions that any agent platform can load as system context.

The Evaluation Setup

Figure 2: The Contextware skills evaluation workbench running baseline and with-skill conditions for each scenario.

I evaluated the following open-weight and open-access models via OpenRouter:

Nemotron 120B (NVIDIA) — standard and minimal reasoning conditions
MiniMax M2.5 (MiniMax)
GLM-5 Turbo (Z.AI)
Seed 2.0 Lite (ByteDance)
Healer Alpha and Hunter Alpha (OpenRouter)

Results

Here's the headline table across all models:

Model	Baseline	With Skills	Lift
Healer Alpha	70.3%	83.8%	+13.5pp
MiniMax M2.5	67.6%	81.1%	+13.5pp
Nemotron 120B (std)	59.5%	78.4%	+18.9pp
Nemotron 120B (min)	67.6%	78.4%	+10.8pp
GLM-5 Turbo	73.0%	78.4%	+5.4pp
Seed 2.0 Lite	56.8%	75.7%	+18.9pp
Hunter Alpha	43.2%	62.2%	+18.9pp

Every single model improved with skills. The lift ranges from +5pp to +19pp overall, and on the hardest Complex scenarios the gains are even larger: +33–44pp.

No amount of additional model scale alone achieved these results. The knowledge had to be injected.

Finding 1: Skills Matter Most Where Models Are Blind

The Complex scenario tier is the most diagnostic part of the benchmark. These scenarios require logic that genuinely isn't in any training data:

An SLA risk score calculated as Σ(WEIGHT × BREACH_MINUTES) where Platinum=10, Gold=7, Silver=4
A topology traversal that must exclude resources with administrativeState=locked (planned maintenance) to find the true root cause
TMF measurement job creation using g_15mn and r_1h format strings

Without the skill: models either hallucinate a plausible answer, or get the API call wrong at the parameter level. With the skill: they apply the exact logic and pass.

Complex scenario lift across models: +33pp to +44pp. This is where skills earn their keep.

Finding 2: More Reasoning Isn't Always Better

This one surprised me.

I ran Nemotron 120B under two conditions: full reasoning and minimal reasoning (a guardrail preamble instructing it to prefer direct tool calls, skip re-verification steps, and use exact enum values from the skill).

Both conditions scored exactly 78.4% overall with skills.

Identical ceiling. But minimal reasoning scored 88.9% on Complex scenarios vs 77.8% for standard. Reducing reasoning depth improved performance on the hardest tasks.

Why? Because the full reasoning model was burning its budget on the wrong problem.

Finding 3: The Sandbox Discrimination Failure

This is the most significant finding from the Nemotron evaluation.

I traced every tool call across all TMF639 topology analysis scenarios:

Tool	Calls	Category
`connect_to_mcp_server`	33	Infrastructure
`run_command`	28	Infrastructure
`get_environment_variable`	24	Infrastructure
`list_mcp_servers`	18	Infrastructure
`write_file`	15	Infrastructure
`create_sandbox`	7	Infrastructure
`execute_mcp_tool`	5	Domain work

97% of tool calls were infrastructure overhead. 3% was actual domain work.

The model wrote a Python script to call an MCP API tool — when that tool was sitting directly available in its tool palette. Then the ephemeral sandbox expired mid-run. Scenario failed. Not because Nemotron couldn't reason about topology analysis. Because it couldn't decide whether to retrieve data from an API or compute something.

I call this Sandbox Discrimination Failure: the inability to distinguish between "I need to retrieve data from an API" (use the MCP tool directly) and "I need to compute something" (a sandbox is appropriate). Nemotron defaults to sandbox as a general-purpose execution layer regardless of task type.

The cascade looks like this:

Step 3: connect_to_mcp_server (30s)
Step 7: create_sandbox (30s)
Steps 9–15: run_command / write_file cycles (180s)
Step 15: Sandbox expires → recovery attempts (90s)
Step 17+: Scenario timeout (360s total) → FAIL

The agent never reached the actual topology analysis.

Figure 3: Nemotron 120B evaluation results showing the impact of sandbox fixation across TMF domains.

Finding 4: The Reasoning-Prescription Paradox

Reasoning models treat skill instructions as suggestions to evaluate against their general knowledge — not as directives to follow.

When the TMF628 skill says "use g_5mn for 5-minute granularity," Nemotron overrides it with PT5M (ISO 8601) because its training data says ISO formats are more correct. It's internally logical. It's externally wrong.

The API returns a validation error. The scenario fails.

Here's a sample of what we observed:

Skill says	Model used	Model's reasoning
`r_1h`	`PT1H`	"ISO 8601 standard"
`g_5mn`	`r_5mn`	Confused prefix semantics
`unlocked`	`UNLOCKED`	"Enum constants are uppercase"

This leads to a counterintuitive design principle: skills for reasoning models must be more prescriptive than skills for non-reasoning models. You have to explicitly prohibit the substitutions the model will otherwise make. The more capable the model's reasoning, the more guardrails the skill needs.

Finding 5: Baseline-Lift Compression

GLM-5 Turbo (Z.AI) is purpose-built for agent workflows — complex instruction decomposition, multi-step tool chains, execution stability. It shows the clearest example of what I'm calling baseline-lift compression.

GLM-5 achieves the highest baseline of any non-reasoning model (73.0%). Skills add only +5.4pp overall lift. But it still converges on the same 78.4% with-skill ceiling as Nemotron — and reaches 88.9% on Complex scenarios.

The implication: aggregate skill lift is an unreliable quality signal for capable models. If a model already handles most tasks correctly, skills appear not to help much. But drill into the Complex tier — where proprietary logic is required — and skills deliver the same +33pp regardless.

Also observed: on domains where GLM-5 already achieves 100% baseline, injecting a skill can hurt performance. Service assurance: 100% drops to 75% with skill. Domain guidance adds noise where the model already knows the answer.

What This Means for Architects Building Agentic Telco Systems

1. Pretrained models are not enough. Every model tested — regardless of capability tier — improved with skills. The TMF-specific knowledge (enumeration formats, API sequences, business logic) simply isn't in training data at sufficient depth. You cannot engineer your way around this with a bigger model.

2. Skills are a practical, model-agnostic layer. The same SKILL.md document improved performance across every model tested. They're portable, version-controllable, and maintainable by domain experts without ML expertise.

3. Model selection, skill design, and infrastructure are a three-way interaction. A reasoning-heavy model that takes 30 seconds per step will hit sandbox idle timeouts that a 2-second model never encounters. The right model for a TMF724 incident workflow may be the wrong model for a TMF639 topology traversal. Evaluate them together.

4. Complex scenarios are the real test. Overall pass rate flatters every model. The Complex tier — scenarios requiring proprietary logic — is where the real gap opens up, and where skills deliver their highest returns (+33–44pp).

5. Watch for skill interference. High-baseline models can be hurt by skills on domains they already handle correctly. Design skills to add value at the capability boundary, not to duplicate what the model already knows.

The Skills Pack

The 8 TM Forum skills I used in this evaluation are portable SKILL.md documents covering:

billing-inquiry — cross-referencing orders, catalog pricing, and inventory records
customer-onboarding — multi-API activation across TMF629/620/622
incident-management — SLA-weighted triage and dispatch ranking
network-incident-assessment — situational analysis across active incidents
product-management — catalog and order orchestration
service-assurance — troubleshooting across customer and inventory APIs
tmf628-performance-manager — KPI job creation and anomaly detection
tmf639-topology-analysis — root cause analysis with maintenance exclusion and SLA priority weighting

If you want the full skills pack, connect with me on the LinkedIn and I'll DM you the documents.

Conclusion

Generalist AI agents are capable. They can reason about telecom problems, use API tools, and produce operational outputs. But they lack the domain-specific knowledge — the exact API sequences, enumeration formats, and business rules — that production telecom operations require.

Structured skills close that gap reliably and cost-effectively across every model tested. The hardest tasks show the biggest returns. And the findings around reasoning models — the sandbox fixation, the enumeration substitution, the prescription paradox — have direct implications for anyone deploying agentic AI in an API-heavy regulated environment.

The question isn't whether your model can reason. It's whether it's reasoning too much.

Full research paper and benchmark results: [https://arxiv.org/abs/2603.15372]
LinkedIn: [https://www.linkedin.com/in/ivobrett]
GitHub: [https://github.com/oidebrett]

DEV Community