The Early Days of Umbraco MCP
When Matt Wise and I first started building the Umbraco CMS Developer MCP Server, our focus was purely on functionality. Can we expose the Umbraco Management API through MCP? Can an AI assistant create content, manage media, configure document types? The answer was yes, and we got excited building out tool after tool.
What we weren't thinking about was efficiency. Token usage? Cost per operation? Time taken? Sustainability of running AI-powered workflows at scale? These weren't on our radar. We were in "make it work" mode, not "make it efficient" mode.
But as we moved beyond proof-of-concept late last year, these questions became a consideration.
Why Efficiency Matters
The Hidden Costs
With subscription-based services like Claude Pro or ChatGPT Plus, inefficiencies are often hidden. You pay a flat fee and never see the true cost of each operation. It's easy to ignore efficiency when the bill doesn't change—until you hit usage limits or get rate-limited mid-workflow.
But efficiency matters, whether you see it or not. There are three factors at play here:
Time
A workflow that takes 40 seconds instead of 20 seconds isn't just slower—it's friction. Developers waiting for AI operations lose focus, context-switch, or abandon the tool entirely. Speed matters for adoption.
Tokens
More tokens means more computation, more latency, and faster consumption of usage limits:
- Higher latency - Each token adds processing time
- Faster limit consumption - Subscriptions and APIs both have token caps
- Compounding inefficiency - Wasteful prompts multiply across every operation
Cost
Hidden behind subscriptions—until reality hits:
- You scale up - Subscription limits get hit, rate limiting kicks in
- You need multiple seats - What works for one developer becomes expensive across a team
- You switch to API pricing - Pay-per-token models expose every inefficiency immediately
The difference between $3 and $13 per 1000 operations is the difference between a sustainable tool and an expensive experiment.
Efficient prompts and capable models that reason in fewer tokens compound savings across every operation.
Sustainability
Beyond these direct concerns, there's the broader question of computational sustainability. More efficient models that complete tasks in fewer tokens and less time have a smaller environmental footprint. When you're running thousands of AI operations, choosing a model that's 30% faster isn't just about saving seconds—it's about responsible resource usage.
Enter the Claude Agent SDK
Recently, we integrated the Claude Agent SDK into our evaluation test suite (similar to acceptance tests for websites). This gave us something we didn't have before: visibility into what was actually happening during AI-powered workflows.
For each test run, we could now track:
- Execution time - How long does the workflow take?
- Conversation turns - How many back-and-forth exchanges with the LLM?
- Token usage - Input and output tokens consumed
- Cost - Actual USD spent per operation
This data transformed our understanding of how different models perform with Umbraco MCP.
Prompt Engineering for Smaller Models
An important caveat: we're not just throwing prompts at these models and hoping for the best. Our evaluation prompts are deliberately optimised for smaller, faster models.
This means:
- Explicit task lists - Numbered steps rather than open-ended instructions
- Clear variable tracking - "Save the folder ID for later use" rather than assuming the model will infer this
- Specific tool guidance - "Use the image ID from step 3, NOT the folder ID" to prevent confusion
- Defined success criteria - Exact strings to output on completion
We're reducing the cognative load on the models to infer what needs to happen. Instead, we're giving structured, unambiguous instructions that even smaller models can follow reliably.
This is a deliberate trade-off: more verbose prompts, but consistent results across model tiers. And it's working—Umbraco MCP performs well even with smaller, faster models when the prompts are clear.
Our Evaluation Approach
Our test suite is still limited—we're in early stages. Consider this an interesting experiment rather than rigorous benchmarking. That said, we designed two representative scenarios:
Simple Workflow
A basic 3-step operation: create a data type folder, verify it exists, delete it. This tests fundamental CRUD operations and tool calling.
Complex Workflow
A 10-step media lifecycle: create folder, upload image, update metadata, check references, move to recycle bin, restore, permanently delete image, delete folder. This tests state management, ID tracking across operations, and multi-step reasoning.
Here's what the complex workflow test looks like:
const TEST_PROMPT = `Complete these tasks in order:
1. Get the media root to see the current structure
2. Create a media folder called "_Test Media Folder" at the root
- IMPORTANT: Save the folder ID returned from this call for later use
3. Create a test image media item INSIDE the new folder with name "_Test Image"
- Use the folder ID from step 2 as the parentId
- IMPORTANT: Save the image ID returned from this call
4. Update the IMAGE to change its name to "_Test Image Updated"
- Use the image ID from step 3, NOT the folder ID
5. Check if the IMAGE is referenced anywhere
6. Move the IMAGE to the recycle bin
- Use the image ID from step 3, NOT the folder ID
7. Restore the IMAGE from the recycle bin
8. Delete the IMAGE permanently
9. Delete the FOLDER
10. When complete, say 'The media lifecycle workflow has completed successfully'`;
Notice how explicit the prompt is—we're telling the model exactly what to do, which IDs to track, and what to avoid confusing. This is what allows smaller models to succeed.
We ran each workflow multiple times across five Claude models:
- Claude 3.5 Haiku (our baseline)
- Claude Haiku 4.5
- Claude Sonnet 4
- Claude Sonnet 4.5
- Claude Opus 4.5
Results: Simple Workflow
| Model | Avg Time | Avg Turns | Avg Cost |
|---|---|---|---|
| Haiku 3.5 | 12.4s | 4.0 | $0.017 |
| Haiku 4.5 | 8.6s | 3.7 | $0.019 |
| Sonnet 4 | 13.9s | 4.0 | $0.025 |
| Sonnet 4.5 | 11.8s | 3.0 | $0.021 |
| Opus 4.5 | 26.4s | 8.0 | $0.123 |
Key finding: Haiku 4.5 completed simple tasks ~40% faster than Haiku 3.5 at nearly the same cost.
Results: Complex Workflow (Media Lifecycle)
| Model | Time | Turns | Cost |
|---|---|---|---|
| Haiku 3.5 | 31.1s | 11 | $0.029 |
| Haiku 4.5 | 21.5s | 11 | $0.036 |
| Sonnet 4 | 37.9s | 11 | $0.081 |
| Sonnet 4.5 | 40.4s | 11 | $0.084 |
| Opus 4.5 | 42.5s | 11 | $0.134 |
Key finding: All models completed the complex workflow in exactly 11 turns—the task complexity normalized behavior. But execution time and cost varied dramatically.
Analysis
A few important caveats before we draw conclusions:
- These results come from a small number of test runs—not statistically significant
- Our prompts are heavily optimised for smaller models; less explicit prompts may favour larger models
- This is an area worth exploring further, not a definitive recommendation
In Our Tests, Haiku 4.5 Performed Best
For our specific Umbraco MCP workloads with well-structured prompts, Claude Haiku 4.5 (claude-haiku-4-5-20251001) delivered:
- 31% faster execution than Haiku 3.5 on complex workflows
- 44-49% faster than Sonnet and Opus models
- Best cost/performance ratio across all tests
In Some Cases, More Expensive Models Didn't Help
This surprised us. We expected Sonnet or Opus to complete tasks more efficiently—fewer turns, smarter tool usage. In our tests, we saw:
- Same turn count - Complex workflows took 11 turns regardless of model
- Slower execution - Larger models have higher latency per turn
- 2-4x higher cost - With no corresponding benefit
For structured MCP tool-calling tasks with explicit prompts, the additional reasoning capability of larger models didn't translate to better performance in our testing. The task was well-defined, the tools were documented, and Haiku handled it well.
Cost Projection at Scale
| Model | Cost per 100 operations |
|---|---|
| Haiku 3.5 | ~$2.90 |
| Haiku 4.5 | ~$3.60 |
| Sonnet 4/4.5 | ~$8.00 |
| Opus 4.5 | ~$13.40 |
For a team running 1,000 AI-assisted operations per month:
- Haiku 4.5: ~$36/month
- Opus 4.5: ~$134/month
That's nearly 4x the cost for slower performance.
Based on this analysis, we've updated Umbraco MCP's default evaluation model to Claude Haiku 4.5.
If you're building MCP-based workflows for Umbraco (or similar structured API interactions), consider:
- Start with Haiku 4.5 - It's fast, capable, and cost-effective
- Invest in prompt engineering - Upfront effort on explicit, well-structured prompts can reduce the need for more intelligent models. Let your prompts do some of the reasoning, not just the model
- Measure before upgrading - Don't assume bigger models are better for your use case
- Track your metrics - Use the Agent SDK or similar tools to understand actual cost and performance
What's Next
This is just the beginning of our optimisation journey. Our evaluation suite is growing, and we plan to:
- Add more complex multi-entity workflows
- Test edge cases and error recovery
- Evaluate performance as our tool
- Continue refining prompts for maximum efficiency with smaller models
The key takeaway: Umbraco MCP works well even with smaller, faster model if you are more explicit about process. You don't need the most expensive LLM to manage your CMS effectively—you need clear prompts alongside our well-designed tools.
This analysis was conducted in January 2025 using the Claude Agent SDK against a local Umbraco 17 instance. Results may vary based on network latency, Umbraco configuration, and specific workflow complexity.
Top comments (0)