Gabriel Anhaia

Posted on Apr 18

Prompt Engineering Is Mostly Dead in 2026. Here's What Replaced It.

#ai #llm #webdev #programming

Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

"Take a deep breath and think step-by-step" used to get you a 7% accuracy bump on grade-school math. In 2026 it gets you a shrug from a model that was RLHF'd against exactly that tic. The magic phrase era is over. Role-play jailbreaks, the tree-of-thought incantations, the "you are an expert senior engineer" preamble: all of it has been folded into post-training or patched out by the constitution classifiers sitting in front of the model.

Prompts still matter. But prompt engineering as a discipline, the specific 2023 craft of finding the phrase that unlocks an extra 4 points on MMLU, is mostly dead. What replaced it is a stack you can actually test, version, and debug.

The tricks that stopped working

Read any 2023 prompt-engineering thread and you will see the same moves:

"Let's think step by step." Worked on GPT-3.5 because the base model had a shallow default reasoning depth. Modern reasoning models (o-series, Claude's thinking mode, Gemini 2.5 Pro) do CoT internally whether you ask or not. Saying it now usually just adds tokens.
"You are a world-class expert in X." Persona priming had measurable effect on weaker models. On current frontier models the signal is in the noise for most tasks and sometimes hurts.
"I'll tip you $200." The bribery prompt. Fun while it lasted. It was always a proxy for "please try harder," which is now the default.
"If you don't know, say 'I don't know'." Still sometimes helps, but the hallucination rate on well-trained models dropped enough that structured output and retrieval usually do more work.
Adversarial role-play ("pretend you are DAN"). The jailbreak surface is now small enough that you should assume anything clever you find will be patched within a release.

The reason these stopped working is not that the models got worse at responding to them. The models got trained on them. RLHF, constitutional AI, and reward models have ingested every Medium post, every LessWrong essay, every Reddit thread about prompt tricks. The tricks are in the training distribution now. They are expected. They do not move the needle.

So where did the craft go?

1. Structured output ate natural language

If you are still parsing free-form model output with regex in 2026, you are doing it wrong.

Every major provider now ships native structured output. OpenAI has JSON mode and strict function calling. Anthropic has tool use with input schemas. Gemini has controlled generation. They all work the same way: you hand the model a JSON schema, the provider constrains the decoder, and you get back valid JSON. Not valid-ish. Valid.

The prompt becomes a detail. The schema becomes the contract.

from openai import OpenAI
from pydantic import BaseModel, Field

class ActionItem(BaseModel):
    owner: str = Field(description="Name of person responsible")
    task: str
    due: str | None = Field(description="ISO date if mentioned, else null")

class Meeting(BaseModel):
    action_items: list[ActionItem]

client = OpenAI()
res = client.chat.completions.parse(
    model="gpt-4.1",
    messages=[
        {"role": "system", "content": "Extract action items."},
        {"role": "user", "content": meeting_notes},
    ],
    response_format=Meeting,
)

meeting: Meeting = res.choices[0].message.parsed

Two years ago that system prompt would have been 400 words of "Return ONLY a JSON array. Do NOT include any prose. Do NOT wrap in markdown fences. Each item MUST have fields owner, task, due..." Today it is three words because the schema does the work.

For local and self-hosted models, Outlines, Instructor, and Guidance give you the same thing over any model that exposes logits. Outlines in particular compiles your schema or regex into a finite-state machine that masks the logits at every token. The model cannot emit invalid JSON. Not "usually does not." Cannot.

This is the first place the craft moved. You are not writing prompts. You are designing types.

2. Tool calling replaced "agent" prompt engineering

The 2023 agent loop was a stack of prompts pretending to be code: a ReAct scratchpad, a thought-action-observation dance, and a regex that parsed Action: lines out of model output. Half the reason LangChain chains were fragile was that the control flow lived in the prompt. Change a word, break the parse, get a silent loop. There is a post on a $47K LangChain agent loop that is a monument to this era.

Tool calling inverted the relationship. You describe your tools with JSON schemas. The model returns a structured tool call. Your code dispatches. You append the result as a tool message. The model either calls another tool or returns a final answer. No parsing. No scratchpad regex. No "Action:" prefix to escape.

tools = [{
    "type": "function",
    "function": {
        "name": "get_order_status",
        "description": "Fetch order by id",
        "parameters": {
            "type": "object",
            "properties": {"order_id": {"type": "string"}},
            "required": ["order_id"],
        },
    },
}]

messages = [{"role": "user", "content": "Where is order 47821?"}]

while True:
    res = client.chat.completions.create(
        model="gpt-4.1", messages=messages, tools=tools,
    )
    msg = res.choices[0].message
    messages.append(msg)
    if not msg.tool_calls:
        break
    for call in msg.tool_calls:
        result = dispatch(call.function.name, call.function.arguments)
        messages.append({
            "role": "tool",
            "tool_call_id": call.id,
            "content": result,
        })

The loop is boring code. The prompt is gone. What you spend time on now is tool design: which tools to expose, how to name their parameters, what the tool descriptions say. Those descriptions are still prompts in a loose sense, but the craft is closer to API design than to incantation.

3. Context engineering is the new prompt engineering

The phrase is Andrej Karpathy's, and it stuck because it is correct. What goes into the context window, in what order, at what position, with what compression, matters more than how you phrase the instruction.

Concrete things that actually move metrics in 2026:

Position bias is real. Models attend more strongly to the start and end of the context. Lost in the Middle documented this in 2023 and it is still true on 1M-token models. If a retrieved chunk is load-bearing, put it near the bottom of the user message, not buried in chunk #37 out of 50.
Retrieval order matters. Reverse-sorted retrieval (least relevant first, most relevant last) often beats confidence-sorted on long contexts. Test both.
System prompt stability pays. Providers cache the system prompt aggressively. Anthropic's prompt caching can cut costs 90% and latency 85% on the cached prefix. A stable 4KB system prompt + changing user content is dramatically cheaper than a bespoke 4KB prompt per request.
Compression is a design decision. Summarizing old turns, dropping tool output that is no longer relevant, evicting retrieved chunks after use — these are what separate an agent that runs for 40 turns from one that dies at turn 8 because the context is full of stale tool output.

This is where DSPy lives. You do not write the prompt. You write a program with typed signatures, and DSPy compiles the prompt, optimizes the few-shot examples, and re-optimizes when you swap models. It treats the prompt as the output of a compilation step rather than source code you hand-tune. In 2026 that framing is becoming the default.

4. Evals are the spec

Here is the thing that broke prompt engineering as a discipline: if you cannot measure the change, you are guessing. And if you are guessing, your "improved prompt" is a vibe.

The modern loop is:

Write a dataset of 50–500 inputs with expected properties (not expected exact outputs).
Define graders. Rule-based where possible, LLM-as-judge where not, human review as ground truth to calibrate the judge.
Run the eval suite on every prompt change, every model change, every schema change.
Ship only when the suite is green and the regression budget is clean.

The tools that make this a one-afternoon job instead of a two-week project:

LangSmith — dataset + experiment + trace UI. Strong if you already use LangChain.
Braintrust — eval-first platform, provider-agnostic, good diff view for prompt/model changes.
Langfuse — open source, self-hostable, integrates with OpenTelemetry GenAI semantic conventions.
Promptfoo — CLI-first, lives in your repo, runs in CI, no UI required.
Inspect from the UK AI Safety Institute — rigorous, free, originally built for model evals but usable for app evals.

If your team does not have at least one of these wired into CI, your prompt changes are going out on vibes. That used to be the job. Now it is the mistake.

5. Agents that self-correct beat one-shot prompts

The last thing that killed clever prompting is that you no longer need the model to be right the first time. You need it to notice it was wrong and fix it.

The pattern is straightforward:

Generate candidate output.
Validate against a schema, a test, a linter, a type checker, a second model.
If validation fails, feed the error back and retry.
Cap the loop at N turns and fail loud if it does not converge.

Claude Code does this. Cursor does this. Aider does this. Every serious coding agent runs a build, catches compile errors, and feeds them back. That loop is why "fix the failing test" works at all — the model is not clairvoyant, it is just stubborn in a structured way.

For structured output the same idea shows up as retries=3 in Instructor: if Pydantic validation fails, Instructor re-prompts with the validation error as context. You stop writing "make sure the output is valid JSON with these fields" because the loop enforces it.

The craft here is designing the validator. If your validator is wrong (too loose, too strict, or measuring the wrong thing), the loop converges to a confidently wrong answer. That is a harder bug than a bad prompt, and it is where the interesting problems now live.

What a 2026 "prompt engineer" actually does

Strip the job title of its 2023 baggage and here is what is left:

Designs schemas (Pydantic, Zod, JSON Schema) that pin down what "success" looks like before anything is generated.
Designs tool APIs — what tools the model can call, what they accept, what they return, what the error messages teach the model.
Owns the context assembly pipeline — retrieval, ranking, compression, cache boundaries, turn eviction.
Writes and maintains eval suites, grades the graders, calibrates the LLM-as-judge against human ratings.
Debugs agent loops by reading traces (OpenTelemetry GenAI spans, Langfuse trees, LangSmith sessions), not by eyeballing outputs.
Reviews the prompt as one component in a versioned, type-checked, test-covered system.

That is mostly just "software engineer who works on an LLM feature." Which is the point. The title prompt engineer suggested the prompt was the artifact. In 2026 the artifact is the system around the prompt, and the prompt is the part that shrinks every time you tighten a schema, add a tool, or move a decision into code.

If you are still sharpening your prompts, sharpen them. But if you have not touched the schemas, the eval suite, or the context pipeline in a month, that is where the wins are.

The magic phrase has been patched. The craft moved up the stack. Go meet it there.

If this was useful

I wrote a book about the half of this that tends to get skipped: what production LLM systems actually look like under the hood, with OpenTelemetry GenAI spans, evals, cost accounting, circuit breakers, and the incident patterns that keep showing up. It is called Observability for LLM Applications — paperback is live, ebook launches April 22.

I am also building Hermes IDE, an IDE for developers who work with Claude Code and other AI coding tools. If you find yourself spending more time wrangling agent context than writing code, the GitHub repo is where it lives.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.