DEV Community: Abhijeet Hiwale

Why Your Prompt Is Only 5% of What the Model Sees

Abhijeet Hiwale — Mon, 22 Jun 2026 05:43:20 +0000

Most developers think they're prompting AI. They're actually injecting a tiny message into a much larger machine — and the machine is mostly running without them.

Here's the uncomfortable math: in production AI systems, the user's actual prompt is often less than 5% of the total context sent to the model. The other 95%? System instructions, retrieved documents, conversation history, injected data, tool results, and examples the developer constructed before your message even arrived.

This distinction has a name: context engineering. And if you don't understand it, you'll keep blaming the model for problems that are actually yours.

What the model actually sees

When you type a message into ChatGPT or any AI product, you're not talking directly to the model. You're contributing to a larger document — the full context window — that gets assembled behind the scenes before any inference happens.

Here's a simplified version of what that looks like for a tool like Cursor when a developer types seven words — "Add error handling to this function":

[System prompt: You are an expert software engineer. Write clean, production-ready code. Follow the existing coding style...]
[Current file: 500-2000 tokens of your code]
[Related files: 300-1000 tokens of imports, types, interfaces]
[Project structure: This is a TypeScript/Next.js project using Prisma ORM]
[Recent edits: what you changed in the last 5 minutes]
[Error messages: current terminal output]
[User message: "Add error handling to this function."]

Total context: 2,000–5,000 tokens. Your message: 7 words.

That's why Cursor writes code that actually fits your project — correct imports, matching style, right error types. The model itself isn't smarter. The context construction is.

Five layers that actually shape the output

Working through an ML cohort recently, one framework stuck with me as genuinely useful — breaking context down into five layers. Each one narrows the probability space the model draws from.

Layer 1: Role. Tell the model who it is. "You are a senior backend engineer" shifts vocabulary, depth, and assumptions. The model draws from patterns in its training data that match that role.

Layer 2: Task. Be specific about what you want. "Give me 3 options with tradeoffs" is different from "explain this." The model needs the shape of the output before it can produce a good one.

Layer 3: Knowledge. This is the most powerful layer. Inject context the model doesn't have — your codebase, your domain, your constraints. A model with your specific context beats a bigger model with a generic prompt every time.

Layer 4: Format. Define the structure. Bullet points, max two sentences each, with an example. The model is trained on millions of formatted documents and follows formatting instructions precisely.

Layer 5: Constraints. Say what you don't want. "No generic advice. No paid ads. Only approaches that work for developer tools." This eliminates the parts of the probability space you're not interested in.

The difference between a prompt that uses zero of these layers and one that uses all five isn't incremental. It's the difference between a model averaging across all possible responses to a topic versus drawing from a small, highly relevant slice.

The same model, completely different behavior

Here's the thing that took a while to internalize: the model's weights don't change. What changes is the context.

Claude, for example, has a system prompt you never see — a set of behavioral instructions baked in before your message arrives. That's what shapes its honesty about uncertainty, its tendency to show reasoning, its refusal to make things up. Change the system prompt, change the behavior. Same model, same parameters, completely different assistant.

This is also why the same base model powers completely different products. The AI that answers your customer support query and the AI that writes your code are often the same underlying model with different context construction.

Coming from a backend engineering background — building APIs, managing microservices, writing systems where every component is traceable — this framing clicked immediately. Context engineering is just configuration. The model is the runtime. What you inject determines what runs.

What this means practically

If your AI feature is producing mediocre output, the default move is to reach for a bigger model or a better prompt. Most of the time, the actual fix is in the context you're constructing.

Before upgrading the model, ask:

What does the model actually see when a request arrives?
Is the relevant context being retrieved and injected, or assumed?
Are the role, task, format, and constraints explicitly defined — or left for the model to guess?

RAG systems are essentially automated context engineering. Instead of manually figuring out what the model needs to know, you retrieve it dynamically from a vector database and inject it into the prompt. The model's job stays the same — next-token prediction over whatever it sees. The engineering work is in making sure it sees the right things.

The shift from "prompt engineering" to "context engineering" sounds like semantics. It's not. Prompt engineering treats the user's message as the thing to optimize. Context engineering treats the entire input — everything the model sees — as the system to design.

That reframe changes what you build, what you debug, and what you blame when something goes wrong.

Lets connect on LinkedIn : https://www.linkedin.com/in/abhijeethiwale/

Your AI Isn't Broken. Your Architecture Is.

Abhijeet Hiwale — Sun, 21 Jun 2026 06:49:36 +0000

Everyone blames hallucination. I've started blaming the design.

I work on a fintech banking platform — Java, Spring Boot, microservices. When a payment fails, we don't shrug and say "the network is probabilistic." We trace it. We find the exact hop where something went wrong. We fix it.

But when an LLM-powered feature fails, the default reaction is usually: "yeah, AI hallucinates sometimes."

And after going through a structured ML cohort over the last few weeks, I think I finally understand why.

The model is working fine. The pipeline isn't.

Large language models are probabilistic by design. They don't look up answers — they generate the most statistically likely next token given context. That means they will occasionally produce plausible-sounding output that isn't grounded in fact.

This is a known property, not a bug. The mistake is building systems that treat this probabilistic step as if it were a deterministic one.

Here's a concrete example. Say you're building a banking chatbot that needs to:

Parse the user's intent ("show me last month's transactions over ₹5000")
Query the transactions database
Format and summarize the results
Respond to the user

Steps 2 and 3 are deterministic. There's a correct answer. The transactions either exist or they don't. The sum is either right or wrong.

If you route those steps through an LLM — asking it to generate a SQL query, run it mentally, summarize the output — you've introduced a probabilistic component where zero ambiguity is acceptable. In a financial context, a "plausible-sounding" transaction summary that's 3% wrong is not a minor UX issue. It's a compliance problem.

The math compounds fast

Here's what most people miss when they start chaining LLM calls together.

If each step in your pipeline has a 90% success rate — which sounds fine — and you have 5 steps, your overall pipeline reliability is:

0.9 × 0.9 × 0.9 × 0.9 × 0.9 = ~59%

A 5-step agentic workflow where every node is an LLM call fails 4 out of 10 times. Not because any single step is broken. Because the architecture is wrong.

This is something I think about in terms of how we handle fraud detection on our platform. The ML model's job is to score a transaction — is this pattern anomalous? That's genuinely probabilistic. Pattern matching under uncertainty is exactly what the model is good at.

But the downstream decision — block the card, flag for review, let it pass — that's a deterministic rule engine. Hard thresholds. Business logic. Audit trails. Putting an LLM in that loop would be architecturally insane, regardless of how good the model is.

The model handles ambiguity. The function handles decisions.

The part nobody talks about in tutorials

Every LLM tutorial shows you the happy path. Very few show you where the model should be completely absent from the pipeline.

The design question worth asking: which parts of this workflow require genuine judgment or language understanding, and which parts have a correct, verifiable answer?

LLM's job: extract intent, handle ambiguity, generate natural language.
Function call / API / rule engine's job: everything with a ground truth.

This isn't a new insight — it's basically what tool-use and function calling were invented for. The model decides what to do. A real function actually does it. But a lot of builders still treat function calling as a nice-to-have instead of a load-bearing architectural decision.

Where to actually look when your AI feature breaks

For a while I thought the hard part was getting the model to behave. Prompt engineering. Fine-tuning. Better retrieval.

The cohort work I've been doing shifted that. The models are actually pretty capable. What's hard is:

Knowing which parts of your pipeline should never touch the model
Building the decision layer that acts on model output — the bridge between a score or a label and an actual system action
Tracing failures accurately so you don't blame the model when the architecture is wrong

If your AI feature is unreliable, the honest diagnostic question is: how many of my pipeline steps are probabilistic that shouldn't be? The answer is usually more than you think.

Hallucination is real. But it's also one of the most convenient excuses in AI engineering right now.

Most of the failures I've seen — in projects, in tutorials, in production systems discussed in public postmortems — aren't the model generating nonsense. They're systems that were designed without a clear line between "where the LLM is appropriate" and "where a function call is appropriate."

Draw that line first. Build around it. Then see how often the model is actually the problem.