Navin Varma

Posted on May 10 • Originally published at nvarma.com

Architecture Tradeoffs Across the Agentic Spectrum

#ai #software #buildinginpublic #agenticai

Originally published on nvarma.com

I've been missing a couple of weekends now as I was working towards this post, but life gets busy. This was a fun experiment and a lot of learning.

The age of abundance

Elon Musk has been predicting an age of abundance driven by AI and robotics. For software engineers, that abundance is already here. Code is cheap to produce in 2026. AI coding tools handle the routine, and the claim is that a single senior engineer with AI assistance can now output what used to require a small team. The phenomenon could be real and gaining popularity, but the same sound architectural principles still apply. Addy Osmani calls this the next two years of software engineering. Gergely Orosz is tracking the real impact on working engineers. This post frames it really well: abundance is cheap, trust is not.

When code becomes cheap, architecture becomes more valuable. The decisions around the code, cost vs. quality, stateless vs. stateful, deterministic vs. probabilistic become even more important. They matter because bad architecture compounds faster when you can produce code this quickly.

I wrote about this earlier as software engineering being a craft. As I get more into Applied AI and Agents professionally, I built four projects across the agentic spectrum on free-tier infrastructure to understand how the tradeoffs work in this world. Every architecture decision I made was one I've been making for a long time.

Tinkering around

When I built the first project, I genuinely thought I was building an AI agent. It generates images, writes captions, publishes them, it's using AI, so my naive sense was it is agentic. It took me a while to realize it was a cron job with an API call. I work on adopting AI-native workflows professionally too, so that gap between what I assumed and what I'd actually built was a lesson for future me. I've always learned by building, so this was another one of those lessons that sticks because I learned it the hard way.

A decision framework

This was something I built after the fact messing around with Excalidraw MCP by defining the questions that went through my mind and then visualizing it. The first decision is a key one. Can you write down every step your system will take before it runs?

Five questions leading to five architecture choices. The first one is if you can write down every step your system will take before it runs.

Architecture	When to use	My example
Deterministic Pipeline	All steps known, no LLM needed	Cross-Publish Pipeline
Single LLM Call	One step needs AI, no chaining	Daily Sea Lion
Orchestrated Workflow	Multiple LLM steps, fixed order	Case Closed
Autonomous Agent	Steps unknown, low-risk actions	Music Trivia Agent
Human-in-the-Loop Agent	Steps unknown, high-risk actions	This is the most interesting in the enterprise space

My weekend journeys

https://www.youtube.com/watch?v=0xI4oqnWkhk

1. Deterministic Pipeline

If every step is known at design time and none of them need an LLM, you don't need AI infrastructure. Scripted ETL, rule engines and mostly automation. Adding an LLM here is overhead with no upside.

My example: My cross-publish pipeline syndicates blog posts to Dev.to and Hashnode. A regex-based sanitizer transforms MDX into platform-specific markdown, a JSON file tracks what went where, GitHub Actions orchestrates the cross-publishing. No LLM anywhere. It's boilerplate code but it never hallucinates. In the age of abundance, knowing when NOT to use AI in your application is as important as knowing when to use it.

The engineering parallel: This is the build-vs-buy decision as it has always been. Can a regex handle this? Then don't wire up an API call that costs tokens, adds latency, and might change your words. Deterministic pipelines are boring and that's the intention. AI-assisted coding speeds up writing this kind of boilerplate, which is actually the best use of it here.

2. Single LLM Call

If you need AI for one step and the output doesn't feed into another LLM call, keep it simple. You write the prompt, structured output gets generated. This covers classification, extraction, summarization, and most "add AI to this feature" use cases. It's the cheapest, fastest, and most debuggable pattern. Default to this unless you have a reason not to.

My example: Daily Sea Lion generates a sea lion image and caption every morning. Pollinations AI produces the image, Groq/Llama 3.3 70B writes the caption, the script publishes to Cloudflare Pages. This uses two independent LLM calls without any chaining between them. The caption doesn't evaluate the image, it just adds a stylistic quote to it.

The tradeoff I made: No quality gate. Prompt constraints substitute for an eval loop. I hardcoded 14 image prompts across moods and the caption prompt bans common failure modes ("No awww or adorable"). Good prompts do the work that an eval loop would do if I could afford one. Again, the best use of AI is the assisted coding to refine your prompts. With budget, I'd generate 3 candidates and use a cheaper model to pick the best. This is the cost-vs-quality decision you'd make sizing a cache.

3. Orchestrated Workflow

When multiple steps need an LLM but the execution path is fixed, you're building a graph (DAG to be specific) with LLM nodes. You control the graph, the LLM controls the content at each node. Think of it as a fixed pipeline where some stages happen to be AI-powered. It gives you LLM flexibility with pipeline predictability.

My example: Case Closed is a detective game. A 70B model generates the mystery once (culprit, clues, red herrings, suspect secrets), then those outputs feed into per-character system prompts that an 8B model uses for chatting with the player as they investigate. The generation step configures the conversation steps which is a fixed dependency, not a runtime decision.

The tradeoffs I made:

Dual-model routing. 70B for generation (one call, quality matters), 8B for conversations (many calls, speed matters). ~10x cost difference. If you think of it as pre-compute vs. runtime, the expensive batch job generates the config once and cheap lookups serve it at request time.

Templated responses for 5 of 6 scenarios. Only the AI-generated "Surprise Mystery" uses live LLM calls for conversations. Cut costs ~80%. It's the same as caching: if the answer doesn't change, don't recompute it.

Server-side secrets. Culprit never sent to the client, anti-cheat rules in the system prompt, regex sanitizer strips leaked answers. The security gates don't change just because the service behind them is an LLM.

Serverless blocked the next pattern. I wanted suspects to consult each other between rounds. Cloudflare Pages Functions are stateless: one request, done. Anyone who's worked with stateless microservices knows this constraint. This led me to Durable Objects for the agent.

4. Autonomous Agent

When the steps aren't predictable at design time, you need an agent that reasons about what to do next. It calls tools, stores data it needs into memory, evaluates results, and iterates in a self-directed loop. This is the most powerful pattern and the hardest to debug. You're trading predictability for capability, and you need budget controls because an open-ended loop with no ceiling is just a way for burning tokens.

My example: The Music Trivia Agent takes a YouTube link and hunts for the most surprising fact about the song. It runs a custom observe-think-act loop on Cloudflare Workers + Durable Objects: Plan → Search (DuckDuckGo) → Extract → Evaluate (1-10 surprise score via Claude Haiku 4.5) → Decide (continue or synthesize). I don't know how many iterations it'll take when I hit "go."

The tradeoffs I made:

Durable Objects solved what serverless couldn't. Stateful container with SQLite that stays alive across requests. Think session-affinity, but at the edge. Exactly what Case Closed needed but couldn't have due to my tradeoff there.

Budget controls shape the architecture. 8 iterations max, 20 LLM calls max, ~$0.007/session average. Score 8+ stops immediately, 7+ after 2 iterations stops. A solid 7 in hand beats chasing a 10 that might not exist. Every agent needs a cost ceiling, the question is where you set the threshold for "good enough."

Haiku for everything, not a model cascade. Tried the cheap-model-for-tools, expensive-for-synthesis split. Routing complexity wasn't worth it. With budget, I'd use Sonnet for evaluation only. Sometimes spending more per unit reduces total spend by enabling earlier stops.

Prompt-based evaluation, not a classifier. System prompt defines the surprise scale with examples. Bigger risk is circular searching, I've watched it search "Beatles recording sessions" three slightly different ways before trying a new angle. In my testing, I saw it struggle with Indian regional music, so I added some sources that are specific to that genre for it to look up with branching.

Input/output sanitization. Prompt injection filtering on metadata, hallucinated-URL stripping on output. Hallucination was the biggest risk I hit in practice, so this is where eval loops earn their keep. No different from validating user input in any web app, and it's not optional for agents that read from the open web.

5. Human-in-the-Loop Agent

When the steps aren't predictable AND the actions carry real consequences, the agent proposes actions but a human approves before execution. This is the right call for any data involving money, people, external communication, data deletion, or irreversible actions. An agent that asks for approval is more valuable than one that guesses right 95% of the time when the other 5% sends an email to the wrong customer. Auditing AI actions is another one if you build an "Always Allow" approval into your agent.

I haven't built one of these in my hobby projects yet, but I know what the right questions to ask are. My Music Trivia Agent is autonomous because the worst case is a controversial surprise trivia fact. If it were sending results to a social media API or writing to a production database, I'd add an approval gate before the final action. The agent loop stays the same, but you just insert a checkpoint before execution. The engineering pattern is identical to any system that separates "plan" from "commit": database transactions, deployment pipelines, two-phase commits. The tool is new but the engineering principle is not.

The boundary shifts over time. Today's human-in-the-loop agent becomes tomorrow's autonomous agent once you've built confidence in its decisions through logging, auditing, evaluation, and gradual trust-building. Start with the approval step, build trust in your design over time to remove it.

Same ol' same old

Each tradeoff in this post maps to something I've done before. Whether it's cost-sensitive routing between services, choosing deterministic logic over something smarter, picking the right persistence model for your execution pattern, threat modeling deeply for sanitization and caching where the answer doesn't change. We're in an age of abundance when it comes to producing code, but the engineering judgment that decides what to build, how to constrain it, and when to keep things simple? That's not abundant. That's scarce, and it's the same craft it's always been.

I'm curious what's next, maybe multi-agent coordination or agents with memory across sessions. I don't know yet, but I'll build it before I form an opinion.

Scaling all of this to a team of more than 1 is a topic for another day. What does it do to team dynamics where your thought partner is your AI friend(s) and maybe a couple of folks you ideate with early on, while traditionally it was collaboration between a group of people throughout the journey of building? This is a fascinating area to say the least.

These are my personal thoughts, experiences, and opinions, and they do not reflect the views of the company I work for.

This post was originally published on nvarma.com. Follow me there for more on software architecture, engineering leadership, and the craft of building things that last.

DEV Community