The Rise of Agentic Engineering — Part 1: The Prompt Engineering Era

#agents #ai #llm #softwareengineering

The Prompt Engineering Era

Part 1 of a chronological survey of how the craft around large language models evolved — from writing prompts, to engineering context, to building harnesses and loops. This installment covers the beginning: the era when the unit of work was the prompt, and the whole job was finding the right words.

TL;DR — When the model is something you talk to one turn at a time, the prompt is the program. Few-shot, chain-of-thought, personas — a whole folk-craft grew up around phrasing. It worked beautifully for chat, and hid four cracks (non-determinism, brittleness, no metrics, no portability) that everything after Part 1 is about. Then ReAct put a loop in the prompt — and the agent era began.

A machine that almost makes tea

In The Restaurant at the End of the Universe, Douglas Adams gives us what the writer Drew Breunig has called one of science fiction's best depictions of prompt engineering. Arthur Dent, adrift on a spaceship, wants a cup of tea. The only appliance aboard is a Nutri-Matic Drinks Synthesizer, which "claimed to produce the widest possible range of drinks personally matched to the tastes and metabolism of whoever cared to use it" — and which, when asked, reliably produces a liquid that is "almost, but not quite, entirely unlike tea."

Arthur tries the obvious prompt — "Tea" — and gets garbage. He tries again, more forcefully, and gets the same garbage. Finally he gives up on terseness and tells the machine everything: about India and China and Ceylon, about broad leaves drying in the sun, about silver teapots and summer afternoons on the lawn, about putting the milk in first so it won't scald. The machine, now taking the request seriously, commandeers the ship's entire computing core to work on the problem.

It is a near-perfect parable for what the first era of working with language models felt like. You had a powerful, general system that would technically do almost anything. Getting it to do the specific thing you wanted was a matter of supplying the right words, in the right amount, with the right framing — and the gap between a terse request and a richly specified one was the gap between failure and success. That gap was the entire discipline. We called it prompt engineering.

The model supplied the intelligence. You supplied the specification — and getting the specification right was the whole job.

This series traces what happened next: how prompt engineering strained under the weight of real applications, how it drifted into a broader practice that acquired the name context engineering, and how that in turn became harness engineering and loop engineering as the model stopped being something you talk to and became something you build a system around. But to understand why each of those shifts happened, it helps to start with what prompt engineering actually was, why it worked, and where its limits were already visible.

What prompt engineering was

A prompt is just the text you give a model. Prompt engineering was the craft of shaping that text to get reliably good output from a system whose behavior you could not otherwise change. You could not retrain the model's weights; you could not see inside it; all you had was the input. So the input became the lever.

Out of that constraint grew a recognizable toolkit. The techniques that mattered most in this era were:

Few-shot prompting. The foundational observation, established at scale in OpenAI's 2020 paper Language Models are Few-Shot Learners (the GPT-3 paper, Brown et al.), was that a sufficiently large model could learn a task from a handful of examples placed directly in the prompt — no fine-tuning required. You showed the model two or three input-output pairs, and it inferred the pattern for the fourth. This reframed the model as something you program by demonstration rather than by instruction alone.

Chain-of-thought prompting. In early 2022, Jason Wei and colleagues at Google published Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (presented at NeurIPS 2022). The finding was deceptively simple: prompt a model to produce intermediate reasoning steps before its final answer, and performance on arithmetic, commonsense, and symbolic reasoning jumps.

Crucially, the authors found this reasoning ability "emerge[s] naturally in sufficiently large language models" — it was not something small models could do well. A few worked examples that showed their work were enough to unlock it. The durable principle it established: how you ask changes not just the style of the answer but the model's apparent capability.

Role and persona prompts. "You are an expert Python developer." Assigning the model a role shifted the distribution of its responses. This was pure prompt-level steering — no new information, just framing.

Structure and formatting. Asking for output in a specific shape — JSON, XML tags, numbered lists, a fixed template — made model output more reliable to parse and, often, more reliable in content, because the structure constrained the model's choices.

Positive and negative examples. Showing the model both what to do and what to avoid sharpened its behavior, the same way examples sharpen instructions for a human.

None of this required understanding the model's internals. It was an empirical, almost folk-craft discipline: try a phrasing, see what comes back, adjust, repeat. And for the dominant use case of the time — a person sitting at a chat interface, having a one- or two-turn exchange with the model — it worked remarkably well.

Why it worked

Prompt engineering worked because, in a single-turn or short conversational exchange, the prompt is the entire program. There is nothing else in play: no accumulated history, no tool outputs, no documents retrieved from elsewhere, no other agents contributing context. The person holds the tool directly and steers it turn by turn.

Human → prompt → Model → output → Human reads, adjusts, repeats.
One turn at a time, with the human as orchestrator, memory, and quality check — all at once.

In this regime the human is doing an enormous amount of invisible work. You remember what was said three turns ago. You notice when the answer drifts and correct it. You decide what information the model needs and paste it in. You judge whether the output is good. The model supplies fluency and knowledge; you supply everything else that makes the interaction coherent. Because that surrounding work lived in your head rather than in a system, it was easy to mistake the prompt for the whole story.

The "tap the sign" idea: language defines what you can build

There is a thread running through this whole series that is worth introducing now, because it explains why the vocabulary keeps changing. Breunig repeatedly returns to a Stewart Brand line — "If you want to know where the future is being made, look for where language is being invented and lawyers are congregating" — and to the linguistic-relativity idea that the words we have set the boundaries of the conversations we can have.

In the prompt-engineering era, the available word was "prompt," and it framed the work as writing a good request. That framing was adequate while the work really was writing requests. As we'll see in later parts, the framing started to mislead once the work became something else — assembling and maintaining the entire informational environment a model operates in. The new work needed a new word. But in 2023 and early 2024, "prompt" still fit, and the craft of prompting was where nearly all the energy went.

The cracks, already visible

Even at its height, prompt engineering had four weaknesses that would later drive the field past it. None of them mattered much for a casual chat. All of them became critical the moment people tried to build dependable software on top of models.

It was non-deterministic. The same prompt could yield different outputs on different runs. For a conversation, fine. For a system that needs to behave predictably, a problem.

It was brittle. Small, seemingly irrelevant changes in wording could change results. A discipline built on "find the magic phrasing" is, by construction, sensitive to phrasing — which means it is fragile. (Later parts will show this quantified: spurious additions to a prompt measurably degrading accuracy, and the same question asked in two voices producing opposite behavior. The brittleness was real, not anecdotal.)

It was unmeasured. Prompt quality was judged by eye. "That looks better" was the standard. There were no evals, no metrics, no regression tests — because for a human tweaking a chat prompt, there didn't need to be.

It was non-portable. A prompt tuned to one model's quirks often failed on another. Since the craft was precisely about exploiting a particular model's response to particular wording, the resulting prompts were welded to that model. This would later acquire a name — prompt debt — and become one of the central problems of the field, but its roots are right here in the era's core method.

These weren't seen as flaws at the time so much as the natural texture of a new medium. They became flaws retroactively, once the ambitions grew.

The hinge: from reasoning to acting

The single most important development at the tail end of this era was the realization that a model's prompt could include not just a question, but a loop — and that the model could use that loop to act on the world, not merely answer.

In October 2022, Shunyu Yao and colleagues published ReAct: Synergizing Reasoning and Acting in Language Models (later presented at ICLR 2023). The paper observed that reasoning (à la chain-of-thought) and acting (generating action plans) had been studied separately, and asked what happens if you interleave them. In ReAct, the model alternates between producing a reasoning trace and taking a task-specific action — and the actions let it "interface with external sources, such as knowledge bases or environments, to gather additional information," while the reasoning helps it "induce, track, and update action plans as well as handle exceptions."

The results were striking for how little they required. On the HotpotQA question-answering and Fever fact-verification benchmarks, letting the model consult a simple Wikipedia API as it reasoned reduced the hallucination and error-propagation that pure chain-of-thought suffered from. On two interactive decision-making benchmarks, ALFWorld and WebShop, ReAct beat imitation-learning and reinforcement-learning methods by an absolute 34% and 10% respectively — "while being prompted with only one or two in-context examples."

ReAct interleaves reasoning and acting in a loop, with the model reading observations back from an external environment. This loop — "run tools in a loop to achieve a goal" — is the seed of everything later parts call an agent.

It is hard to overstate how consequential this shape turned out to be. A prompt that contains a loop, where the model thinks, acts, observes the result, and thinks again, is no longer just a request. It is the beginning of an agent.

A prompt with a loop in it stops being a request and starts being an agent.

Much later in this series, when practitioners define an agent as "a system that runs tools in a loop to achieve a goal," they are describing a direct descendant of ReAct. And the moment a model runs in a loop, accumulating observations and tool outputs as it goes, the neat picture from earlier — the prompt is the whole program — breaks down. The model is now operating inside a growing, evolving body of information that no human is curating turn by turn.

That growing body of information is the subject of everything that follows. The industry spent 2024 betting it could be solved by brute force — by making the context window big enough to hold anything. Part 2 is the story of that bet, and why it didn't pay off the way everyone expected.

Key sources for Part 1

Drew Breunig, Prompt Engineering at the End of the Universe (2024) — the Nutri-Matic parable and the framing of prompting as coaxing an underspecified system.
Tom Brown et al., Language Models are Few-Shot Learners (arXiv:2005.14165, 2020) — the GPT-3 paper; in-context learning from examples.
Jason Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (arXiv:2201.11903; NeurIPS 2022) — reasoning steps in the prompt improve capability; emerges with scale.
Shunyu Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models (arXiv:2210.03629; ICLR 2023) — interleaving reasoning and acting in a loop; the conceptual bridge from prompting to agents.

Next up · Part 2 — The Context-Window Arms Race: how 2024's race to million-token windows promised to make context management obsolete, and what builders discovered instead.