Mamoor Ahmad

Posted on Apr 28 • Edited on May 9

The AI Scaffolding Tax 💰: The Hidden 70% Nobody Warns You About When Building with LLMs

#ai #llm #architecture #webdev

"The model is 30% of the work. The other 70% is everything around it. And nobody warned me."

You've seen the demos. A few lines of Python, an API call to GPT/Claude/Gemini, and boom — magic. Your app can summarize documents, write code, answer questions.

Then you try to ship it to production. And reality hits like a freight train. 🚂

The Demo-to-Production Gap Is a Chasm

Let me paint you a picture:

That clean openai.chat.completions.create() call? It's the tip of the iceberg. Beneath the surface lies what I call The Scaffolding Tax — the massive layer of infrastructure code that exists solely because you chose to use an LLM.

Here's what your "simple AI feature" actually requires in production:

🧱 The Scaffolding Stack

┌─────────────────────────────────────┐
│        Your "Simple" Feature        │  ← What stakeholders see
├─────────────────────────────────────┤
│     Prompt Engineering & Versioning │
│     Context Window Management       │
│     Token Counting & Budget Control │
│     Multi-Provider Abstraction      │
│     Response Parsing & Validation   │
│     Retry Logic & Fallback Chains   │
│     Rate Limiting & Queueing        │
│     Logging & Observability         │
│     Guardrails & Content Filtering  │
│     Caching & Cost Optimization     │
│     A/B Testing Framework           │
│     Prompt Injection Defense        │
└─────────────────────────────────────┘
        ↑ This is the Scaffolding Tax

Every single one of these is mandatory for production-grade AI apps. None of them have anything to do with your actual product.

Let's Talk Real Numbers 📊

I tracked the engineering hours across three AI-powered features we shipped last quarter:

Component	Hours	% of Total
Core LLM logic (the "actual feature")	38h	28%
Context management & chunking	22h	16%
Error handling & retries	18h	13%
Guardrails & safety filters	15h	11%
Logging & observability	12h	9%
Token budgeting & cost controls	10h	7%
Multi-provider abstraction	8h	6%
Prompt versioning & testing	7h	5%
Caching layer	5h	4%
Total scaffolding	97h	72%

Seventy-two percent. Almost three-quarters of our engineering effort went to infrastructure that exists only because we used an LLM. If we'd used a traditional algorithm for the same feature, those 97 hours would have been roughly 10.

That's the Scaffolding Tax. And it's due on every. single. feature.

The 12 Taxes You're Paying (Whether You Know It or Not)

1. 🔤 The Token Counting Tax

You can't just send text to an LLM. You need to count tokens before sending, because:

Context windows have limits
Costs are per-token
Chunking strategies depend on token counts

So now you need a tokenizer library, a chunking algorithm, and budget enforcement — for every provider you support.

# "Simple" code that took 3 days to get right
def chunk_text(text, max_tokens=8000, overlap=200):
    tokens = tokenizer.encode(text)
    chunks = []
    i = 0
    while i < len(tokens):
        chunk_tokens = tokens[i:i + max_tokens]
        chunks.append(tokenizer.decode(chunk_tokens))
        i += max_tokens - overlap
    return chunks

This looks simple. The edge cases are not. What about multi-byte characters? What about code blocks that shouldn't be split? What about markdown headers that need context? 🤯

2. 🧠 The Context Management Tax

Your LLM has no memory. Every call is stateless. So you need to:

Maintain conversation history
Decide what to include (summarize? truncate? sliding window?)
Handle the "context window is full" event gracefully
Manage token budgets across multiple context sources

// The conversation manager nobody tells you about
class ConversationManager {
  private history: Message[] = [];
  private maxContextTokens: number;
  private summaryCache: Map<string, string>;

  async addMessage(msg: Message): Promise<Context> {
    this.history.push(msg);
    const totalTokens = this.countTokens(this.history);

    if (totalTokens > this.maxContextTokens * 0.8) {
      await this.compressHistory();  // ← This is a whole project
    }

    return this.buildContext();
  }

  // This method alone is 200+ lines in production
  private async compressHistory() { /* ... */ }
}

3. 🔄 The Multi-Provider Tax

"What if OpenAI goes down?" Your boss asks on day two.

So now you're abstracting across OpenAI, Anthropic, Google, and maybe a local model. Each has:

Different API formats
Different token limits
Different rate limits
Different error codes
Different streaming behaviors

You're basically building a mini-cloud abstraction layer. For text completion. 🫠

4. 🛡️ The Guardrails Tax

Users will try to break your AI. They will:

Ask it to reveal system prompts
Try to make it say offensive things
Attempt prompt injection attacks
Feed it adversarial inputs

So you need:

Input sanitization
Output filtering
Topic restrictions
PII detection
Prompt injection detection

Each of these is a mini-project with its own edge cases.

5. 💸 The Cost Control Tax

"Can you add AI to this feature?" = "Can you add unpredictable variable costs to this feature?"

You need:

Per-user token budgets
Per-feature cost tracking
Alert thresholds
Graceful degradation when budget is hit
Cost attribution (which feature is burning money?)

6. 📊 The Observability Tax

When your AI gives a bad answer, how do you debug it?

You need:

Full request/response logging (with PII redaction)
Prompt version tracking
Token usage metrics
Latency percentiles
Error rate monitoring
Quality score tracking

Traditional software has logs. AI software needs a forensic lab.

7. 🔁 The Retry & Fallback Tax

LLMs are nondeterministic. They:

Time out
Return malformed JSON
Refuse valid requests
Rate-limit you unexpectedly
Occasionally hallucinate wildly

You need retry logic with exponential backoff, circuit breakers, fallback chains (try GPT-4 → try Claude → try Gemini → cache → degrade gracefully), and response validation.

8. ⏱️ The Latency Tax

LLM calls are slow. 1-30 seconds slow. So your entire UX architecture changes:

Streaming responses become mandatory
Loading states need to be thoughtful
Optimistic UI patterns are essential
Background processing becomes the norm
Users need progress indicators, not spinners

9. 🧪 The Testing Tax

How do you write unit tests for nondeterministic output?

You don't. Not really. You build:

Semantic similarity tests
Golden dataset evaluations
Human eval pipelines
Regression test suites for prompts
A/B testing infrastructure

Testing AI is fundamentally different from testing traditional software. And it's harder.

10. 📦 The Prompt Versioning Tax

Prompts are code. They need:

Version control
A/B testing
Rollback capability
Environment-specific variants (dev/staging/prod)
Performance tracking per version

Most teams store prompts in strings. In code. Mixed with business logic. It works until it doesn't. 💀

11. 🔐 The Security Tax

Your AI processes user input. That input might contain:

Prompt injection attempts
Data exfiltration payloads
Adversarial examples
PII that needs protection

You need input validation, output sanitization, access controls, and audit logging — all specific to the LLM context.

12. 🧩 The Integration Tax

Your AI doesn't live in isolation. It needs to:

Call functions / use tools
Access your database
Respect user permissions
Integrate with existing workflows
Handle authentication

Each integration point multiplies the scaffolding complexity.

The Real Architecture of an AI Feature

Here's what the architecture actually looks like for a "simple" AI-powered search feature:

User Query
    ↓
┌──────────────────────────────────────────────┐
│              Input Validation                 │ ← Prompt injection defense
│              PII Detection                    │ ← Privacy compliance
└──────────────────┬───────────────────────────┘
                   ↓
┌──────────────────┴───────────────────────────┐
│           Context Assembly Engine             │
│  ┌─────────────┐ ┌──────────┐ ┌───────────┐ │
│  │ User Prefs  │ │ History  │ │ Knowledge │ │
│  │ (filtered)  │ │ (summed) │ │  (RAG'd)  │ │
│  └─────────────┘ └──────────┘ └───────────┘ │
│              Token Budget Manager             │ ← Counts, truncates, prioritizes
└──────────────────┬───────────────────────────┘
                   ↓
┌──────────────────┴───────────────────────────┐
│           Provider Router                     │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐       │
│  │ OpenAI  │ │ Claude  │ │ Gemini  │       │
│  │ (fast)  │ │ (smart) │ │ (cheap) │       │
│  └─────────┘ └─────────┘ └─────────┘       │
│     Rate Limiter │ Circuit Breaker           │
└──────────────────┬───────────────────────────┘
                   ↓
┌──────────────────┴───────────────────────────┐
│           Response Pipeline                   │
│  Parse → Validate → Filter → Transform       │
│  Log → Track → Cache → Attribute Cost        │
└──────────────────┬───────────────────────────┘
                   ↓
┌──────────────────┴───────────────────────────┐
│           Output Delivery                     │
│  Stream to client │ Update state │ Notify    │
└──────────────────────────────────────────────┘

This is for one feature. One query box. One AI call.

Now multiply this by every AI-powered feature in your product. 😅

How to Survive the Scaffolding Tax

Okay, doom and gloom over. Here's how smart teams are managing this:

✅ 1. Build a Scaffolding Layer First, Features Second

Don't build scaffolding per-feature. Build a shared AI platform layer:

// Instead of this per-feature:
const response = await openai.chat.completions.create({
  model: "gpt-4",
  messages: [...],
});

// Build this once:
const result = await aiPlatform.complete({
  task: "summarize-document",
  input: document,
  userId: user.id,
  budget: { maxTokens: 4000, maxCost: 0.05 },
  fallback: "cache-or-degrade",
});

The platform handles routing, budgeting, logging, retries, and guardrails. Features just declare intent.

✅ 2. Use AI Middleware Libraries

The ecosystem is catching up. Tools that handle the scaffolding:

LangChain / LlamaIndex — Context management, chains, agents
Guardrails AI — Output validation and structured extraction
LiteLLM — Multi-provider abstraction (100+ providers, one API)
LangSmith / Helicone — Observability and cost tracking
PromptLayer / Promptfoo — Prompt versioning and testing

Don't build what you can buy or borrow. The scaffolding tax is real, but you don't have to pay it from scratch.

✅ 3. Make the Tax Visible

Track scaffolding hours separately in your project management:

Feature: AI-powered document summary
├── Core logic:           8h  (actual feature)
├── Scaffolding:         34h  (infrastructure)
│   ├── Context mgmt:    10h
│   ├── Error handling:   8h
│   ├── Guardrails:       6h
│   ├── Logging:          5h
│   └── Cost controls:    5h
└── Tax ratio:           81%

When leadership sees the real numbers, they make better decisions about which features deserve the AI treatment.

✅ 4. Start With the Hardest Scaffolding First

Most teams build the feature first and bolt on scaffolding later. Flip it:

Week 1: Provider abstraction, logging, cost controls
Week 2: Guardrails, retry logic, testing framework
Week 3: Now build the actual feature

The feature will ship faster because the infrastructure is ready. And you won't be retrofitting security at 2 AM before launch.

✅ 5. Know When NOT to Use an LLM

Not every problem needs an AI. Seriously.

Problem	AI?	Why?
Sentiment analysis	✅ Yes	NLP is hard; LLMs are great at it
Email validation	❌ No	Regex exists. Use it.
Code generation	✅ Yes	Complex output; LLMs excel
Null check	❌ No	Please don't.
Document summarization	✅ Yes	Core LLM strength
Sorting a list	❌ No	`.sort()` doesn't hallucinate

The Scaffolding Tax is $0 for features that don't use LLMs. Choose wisely.

The Uncomfortable Truth

Here it is, the thing nobody wants to say out loud:

The AI Scaffolding Tax means your team is building a platform company whether you want to or not.

Every AI feature you ship adds to your internal platform. You're not just building a product anymore — you're building the infrastructure to build the product. That's a fundamentally different kind of engineering effort, and it needs to be resourced accordingly.

Companies that treat AI features like regular features will drown in scaffolding debt. Companies that acknowledge the tax and invest in the platform will move 10x faster.

The scaffolding isn't waste. It's the real product. The LLM is just the engine — the scaffolding is the car. 🚗

TL;DR 📝

70% of AI engineering effort goes to infrastructure, not features
The "Scaffolding Tax" includes: token counting, context management, guardrails, logging, cost controls, multi-provider support, testing, and more
Build a platform layer first, features second
Use existing tools (LangChain, LiteLLM, Guardrails AI) instead of building from scratch
Make the tax visible — track scaffolding hours separately
Know when NOT to use an LLM — not every problem needs AI
Companies that invest in the scaffolding platform will win

What's Your Scaffolding Horror Story? 💬

I want to hear from you. What's the most ridiculous piece of infrastructure you've had to build just to make an LLM work in production? How much of your engineering time goes to scaffolding vs. actual features?

Drop a comment below. Let's commiserate. 🍻

If this post saved you from a scaffolding surprise, give it a reaction 👍 and follow for more honest takes on building with AI. No hype, just engineering.

Cover image: The AI Iceberg — the model is the tip; the scaffolding is everything beneath the surface.

DEV Community