The Agentic Architect Series: Part 1

#largereasoningmodels #inferencecost #scalebysubtraction #llmops

The Inference Trap: Why “Thinking” is a Technical Debt

By Imran Siddique

There is a misconception in our industry right now that AI and Search are independent — or worse, that AI is a replacement for Search.

I realized very early in my career, well before the current hype cycle, that this is false. The reality is that better Search must precede better AI. If you want an agent to summarize a documentation page, you don’t ask it to “figure it out” or browse the entire web. You give it the exact page. You don’t let the model run wild; you give it context.

We are currently seeing engineers falling into what I call the Inference Trap : throwing massive reasoning models at problems that are actually just retrieval problems.

Reasoning Must Have a “Reason”

We need to be honest about the internals of these systems. “Reasoning” (like chain-of-thought processing) is expensive — both in compute and latency.

If I ask a model, “What is 2 + 2?” and it initiates a deep research agent to verify the axioms of mathematics, that is a failure of architecture. It is a waste of time. Similarly, if I need to know how to call a specific API, I don’t need the AI to “reason” about the philosophy of REST endpoints. I just need it to find the documentation page and extract the curl command.

Reasoning has to have a reason.

If you are using a screwdriver to hammer a nail, getting a bigger screwdriver (a larger model) doesn’t solve the problem. You need a hammer (Search/Lookup).

Scale by Subtraction: The “Don’t Care” Philosophy

My philosophy has always been “Scale by Subtraction.” When applied to AI, this means explicitly removing capabilities.

AI is a massive Black Box. It can do everything — write poetry, code Java, or hallucinate a conspiracy theory. As an architect, I don’t care.

If I am building a tool for brainstorming architecture, I am not interested in its ability to write poetry. In fact, I view that capability as a liability. To build reliable systems, we must apply hard constraints. We subtract the possibilities. We tell the AI: “I don’t care that you can do X, Y, and Z. You are only allowed to do A.”

We need to stop being impressed by the breadth of what AI can do and start being strict about what we allow it to do.

The Compute-to-Lookup Ratio

So, what is the solution? I believe we are missing a critical component in the modern AI stack: The Guardrail Router.

We need a module that sits before the model and decides: Does this actually require reasoning?

If the answer exists in the documentation (the context window), use 100% Lookup.
If the answer requires synthesis of new ideas, allow 10% Reasoning.

In a healthy enterprise system, the ratio should be heavily skewed. My estimate is that 80–90% of tasks should be solved via Lookup (Context/Search), and only the remaining 10–20% should trigger expensive Reasoning.

If your agent is “thinking” for every request, you haven’t built an agent; you’ve built a philosophy major. And in production, we need engineers, not philosophers.