Phil Whittaker

Posted on Jan 11 • Edited on Feb 22 • Originally published at philwhittaker.dev

Choosing the Right LLM for the Umbraco CMS Developer MCP: An Quick Cost and Performance Analysis

#llm #mcp #performance

The Early Days of Umbraco MCP

When Matt Wise and I first started building the Umbraco CMS Developer MCP Server, our focus was purely on functionality. Can we expose the Umbraco Management API through MCP? Can an AI assistant create content, manage media, configure document types? The answer was yes, and we got excited building out tool after tool.

What we weren't thinking about was efficiency. Token usage? Cost per operation? Time taken? Sustainability of running AI-powered workflows at scale? These weren't on our radar. We were in "make it work" mode, not "make it efficient" mode.

But as we moved beyond proof-of-concept late last year, these questions became a consideration.

Why Efficiency Matters

The Hidden Costs

With subscription-based services like Claude Pro or ChatGPT Plus, inefficiencies are often hidden. You pay a flat fee and never see the true cost of each operation. It's easy to ignore efficiency when the bill doesn't change—until you hit usage limits or get rate-limited mid-workflow.

But efficiency matters, whether you see it or not. There are three factors at play here:

Time

A workflow that takes 40 seconds instead of 20 seconds isn't just slower—it's friction. Developers waiting for AI operations lose focus, context-switch, or abandon the tool entirely. Speed matters for adoption.

Tokens

More tokens means more computation, more latency, and faster consumption of usage limits:

Higher latency - Each token adds processing time
Faster limit consumption - Subscriptions and APIs both have token caps
Compounding inefficiency - Wasteful prompts multiply across every operation

Cost

Hidden behind subscriptions—until reality hits:

You scale up - Subscription limits get hit, rate limiting kicks in
You need multiple seats - What works for one developer becomes expensive across a team
You switch to API pricing - Pay-per-token models expose every inefficiency immediately

The difference between $3 and $13 per 1000 operations is the difference between a sustainable tool and an expensive experiment.

Efficient prompts and capable models that reason in fewer tokens compound savings across every operation.

Sustainability

Beyond these direct concerns, there's the broader question of computational sustainability. More efficient models that complete tasks in fewer tokens and less time have a smaller environmental footprint. When you're running thousands of AI operations, choosing a model that's 30% faster isn't just about saving seconds—it's about responsible resource usage.

Enter the Claude Agent SDK

Recently, we integrated the Claude Agent SDK into our evaluation test suite (similar to acceptance tests for websites). This gave us something we didn't have before: visibility into what was actually happening during AI-powered workflows.

For each test run, we could now track:

Execution time - How long does the workflow take?
Conversation turns - How many back-and-forth exchanges with the LLM?
Token usage - Input and output tokens consumed
Cost - Actual USD spent per operation

This data transformed our understanding of how different models perform with Umbraco MCP.

Prompt Engineering for Smaller Models

An important caveat: we're not just throwing prompts at these models and hoping for the best. Our evaluation prompts are deliberately optimised for smaller, faster models.

This means:

Explicit task lists - Numbered steps rather than open-ended instructions
Clear variable tracking - "Save the folder ID for later use" rather than assuming the model will infer this
Specific tool guidance - "Use the image ID from step 3, NOT the folder ID" to prevent confusion
Defined success criteria - Exact strings to output on completion

We're reducing the cognative load on the models to infer what needs to happen. Instead, we're giving structured, unambiguous instructions that even smaller models can follow reliably.

This is a deliberate trade-off: more verbose prompts, but consistent results across model tiers. And it's working—Umbraco MCP performs well even with smaller, faster models when the prompts are clear.

Our Evaluation Approach

Our test suite is still limited—we're in early stages. Consider this an interesting experiment rather than rigorous benchmarking. That said, we designed two representative scenarios:

Simple Workflow

A basic 3-step operation: create a data type folder, verify it exists, delete it. This tests fundamental CRUD operations and tool calling.

Complex Workflow

A 10-step media lifecycle: create folder, upload image, update metadata, check references, move to recycle bin, restore, permanently delete image, delete folder. This tests state management, ID tracking across operations, and multi-step reasoning.

Here's what the complex workflow test looks like:

const TEST_PROMPT = `Complete these tasks in order:
1. Get the media root to see the current structure
2. Create a media folder called "_Test Media Folder" at the root
   - IMPORTANT: Save the folder ID returned from this call for later use
3. Create a test image media item INSIDE the new folder with name "_Test Image"
   - Use the folder ID from step 2 as the parentId
   - IMPORTANT: Save the image ID returned from this call
4. Update the IMAGE to change its name to "_Test Image Updated"
   - Use the image ID from step 3, NOT the folder ID
5. Check if the IMAGE is referenced anywhere
6. Move the IMAGE to the recycle bin
   - Use the image ID from step 3, NOT the folder ID
7. Restore the IMAGE from the recycle bin
8. Delete the IMAGE permanently
9. Delete the FOLDER
10. When complete, say 'The media lifecycle workflow has completed successfully'`;

Notice how explicit the prompt is—we're telling the model exactly what to do, which IDs to track, and what to avoid confusing. This is what allows smaller models to succeed.

We ran each workflow multiple times across five Claude models:

Claude 3.5 Haiku (our baseline)
Claude Haiku 4.5
Claude Sonnet 4
Claude Sonnet 4.5
Claude Opus 4.5

Results: Simple Workflow

Model	Avg Time	Avg Turns	Avg Cost
Haiku 3.5	12.4s	4.0	$0.017
Haiku 4.5	8.6s	3.7	$0.019
Sonnet 4	13.9s	4.0	$0.025
Sonnet 4.5	11.8s	3.0	$0.021
Opus 4.5	26.4s	8.0	$0.123

Key finding: Haiku 4.5 completed simple tasks ~40% faster than Haiku 3.5 at nearly the same cost.

Results: Complex Workflow (Media Lifecycle)

Model	Time	Turns	Cost
Haiku 3.5	31.1s	11	$0.029
Haiku 4.5	21.5s	11	$0.036
Sonnet 4	37.9s	11	$0.081
Sonnet 4.5	40.4s	11	$0.084
Opus 4.5	42.5s	11	$0.134

Key finding: All models completed the complex workflow in exactly 11 turns—the task complexity normalized behavior. But execution time and cost varied dramatically.

Analysis

A few important caveats before we draw conclusions:

These results come from a small number of test runs—not statistically significant
Our prompts are heavily optimised for smaller models; less explicit prompts may favour larger models
This is an area worth exploring further, not a definitive recommendation

In Our Tests, Haiku 4.5 Performed Best

For our specific Umbraco MCP workloads with well-structured prompts, Claude Haiku 4.5 (claude-haiku-4-5-20251001) delivered:

31% faster execution than Haiku 3.5 on complex workflows
44-49% faster than Sonnet and Opus models
Best cost/performance ratio across all tests

In Some Cases, More Expensive Models Didn't Help

This surprised us. We expected Sonnet or Opus to complete tasks more efficiently—fewer turns, smarter tool usage. In our tests, we saw:

Same turn count - Complex workflows took 11 turns regardless of model
Slower execution - Larger models have higher latency per turn
2-4x higher cost - With no corresponding benefit

For structured MCP tool-calling tasks with explicit prompts, the additional reasoning capability of larger models didn't translate to better performance in our testing. The task was well-defined, the tools were documented, and Haiku handled it well.

Cost Projection at Scale

Model	Cost per 100 operations
Haiku 3.5	~$2.90
Haiku 4.5	~$3.60
Sonnet 4/4.5	~$8.00
Opus 4.5	~$13.40

For a team running 1,000 AI-assisted operations per month:

Haiku 4.5: ~$36/month
Opus 4.5: ~$134/month

That's nearly 4x the cost for slower performance.

Based on this analysis, we've updated Umbraco MCP's default evaluation model to Claude Haiku 4.5.

If you're building MCP-based workflows for Umbraco (or similar structured API interactions), consider:

Start with Haiku 4.5 - It's fast, capable, and cost-effective
Invest in prompt engineering - Upfront effort on explicit, well-structured prompts can reduce the need for more intelligent models. Let your prompts do some of the reasoning, not just the model
Measure before upgrading - Don't assume bigger models are better for your use case
Track your metrics - Use the Agent SDK or similar tools to understand actual cost and performance

What's Next

This is just the beginning of our optimisation journey. Our evaluation suite is growing, and we plan to:

Add more complex multi-entity workflows
Test edge cases and error recovery
Evaluate performance as our tool
Continue refining prompts for maximum efficiency with smaller models

The key takeaway: Umbraco MCP works well even with smaller, faster model if you are more explicit about process. You don't need the most expensive LLM to manage your CMS effectively—you need clear prompts alongside our well-designed tools.

This analysis was conducted in January 2025 using the Claude Agent SDK against a local Umbraco 17 instance. Results may vary based on network latency, Umbraco configuration, and specific workflow complexity.

DEV Community