Context Engineering (Part 3): The Frugal Architect

#scalebysubtraction #llmops #engineeringleadershi #finops

In [Context Engineering (Part 1): The Architecture of Recall], we discussed Structure. In [Context Engineering (Part 2): The Temporal Index], we discussed Time and Trust.

Now, we arrive at the bottom line: Economics.

Building a demo is easy. You can throw a 100-page PDF at GPT-4o, wait 10 seconds, pay $0.05, and get a great answer. But in production, if you have 10,000 users, that math kills your company.

Latency : Users won’t wait 10 seconds for “Hello.”
Cost : Burning $0.05 per interaction destroys your margins.

The Senior Engineer knows that Performance is a Feature. We cannot treat every request like a PhD thesis. We need to be Frugal Architects.

1. The Heuristic Router (Speed > Smarts)

The Naive Approach:

“Let’s use a small LLM (like GPT-3.5) to classify the user’s intent, and then route it to the right model.”

The Engineering Reality:

This is “Model-on-Model” overhead. Even a small LLM takes 500ms+ to think. You are adding latency just to decide where to send the traffic. We need to be Fast , even if we are occasionally Wrong.

Use Deterministic Heuristics , not AI Classifiers. We can solve 80% of routing with simple logic that takes 0ms:

Rule 1: Is the query length < 50 characters? -> Send to Fast Model (GPT-4o-mini).
Rule 2: Does it contain keywords like “Summary”, “Analyze”, “Compare”? -> Send to Smart Model (GPT-4o).
Rule 3: Is it a greeting (“Hi”, “Thanks”)? -> Send to Canned Response (Zero Cost).

The goal isn’t 100% routing accuracy. The goal is instant response time for the trivial stuff, preserving the “Big Brain” budget for the hard stuff.

2. The Brutal Squeeze (Chopping > Summarizing)

The Naive Approach:

“The context is too long. Let’s ask an AI to summarize the conversation history to save space.”

The Engineering Reality:

Summarization is a trap.

It costs money to generate the summary.
It loses nuance. “I tried X and it failed” becomes “User attempted troubleshooting.” The specific error code is lost.

My Philosophy:

Chopping (FIFO) is better.

I prefer a brutal “Sliding Window.” We keep the last 10 turns perfectly intact. We delete turn 11.

Why? Because users rarely refer back to what they said 20 minutes ago. But they constantly refer to the exact code snippet they pasted 30 seconds ago.
Summary = Lossy Compression.
Chopping = Lossless Compression (of the recent past).

In a frugal architecture, we value Recent Precision over Vague History.

3. The Trust Gateway (The Middleware Gap)

The Naive Approach:

“Let’s use a startup’s API that auto-routes our traffic to the cheapest model.”

The Engineering Reality:

No Enterprise CISO (Chief Information Security Officer) will send their proprietary data to a random middleware startup just to save 30% on tokens. The risk of data leakage is too high.

This layer-the “ Model Gateway” — is critical, but it requires massive trust.

The Opportunity:

There is a gap here, but it’s not for a SaaS. It’s for Infrastructure.

The Big Players: Microsoft (Azure AI Gateway) and Google will likely dominate this because they own the pipe.
The Startup Play: Don’t build a SaaS Router. Build an On-Prem / Private Cloud Router.

The winner won’t be the one with the smartest routing algorithm; it will be the one the Enterprise trusts with the keys to the kingdom.