<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jeff Reese</title>
    <description>The latest articles on DEV Community by Jeff Reese (@jeffreese).</description>
    <link>https://dev.to/jeffreese</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3839575%2F9b1656df-c3a0-49d3-b3f6-8eb8a48f4524.jpeg</url>
      <title>DEV Community: Jeff Reese</title>
      <link>https://dev.to/jeffreese</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jeffreese"/>
    <language>en</language>
    <item>
      <title>I Built a Custom App in a Day. That Is Not the Interesting Part.</title>
      <dc:creator>Jeff Reese</dc:creator>
      <pubDate>Sun, 03 May 2026 02:58:50 +0000</pubDate>
      <link>https://dev.to/jeffreese/i-built-a-custom-app-in-a-day-that-is-not-the-interesting-part-3dgj</link>
      <guid>https://dev.to/jeffreese/i-built-a-custom-app-in-a-day-that-is-not-the-interesting-part-3dgj</guid>
      <description>&lt;p&gt;Last night, I stayed up too late because I was building something I was excited about.&lt;/p&gt;

&lt;p&gt;That sentence used to mean something different. A year ago, staying up until 3:30 AM meant I was deep in a feature, fighting CSS, debugging edge cases. Last night, it meant I went from recognizing a repeated workflow problem to having a working, tested, production-ready application. In about twelve hours. (7 of those, I was sleeping)&lt;/p&gt;

&lt;p&gt;Here is how that happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem I Kept Solving by Hand
&lt;/h2&gt;

&lt;p&gt;I had spent the day working on my projects: An engagement with a client who I am helping with an app conversion project using agentic development techniques, Waykeep (a vacation tracker app releasing on app stores soon), upgrading the core memory system for my AI assistants, and publishing a blog post. That post needed a cover image, and my assistants helped me build it. They wrote some HTML, we iterated on the layout, and they exported it to PNG using rendering libraries.&lt;/p&gt;

&lt;p&gt;We have done this several times now. Each time, same process: write HTML, iterate, export. Each time, some of the same mistakes. I am not a designer, and I have no desire to become one (I have vast appreciation for art). I just need functional images for my blog posts and distribution channels. So I mentioned to my assistants that we should build a tool for this.&lt;/p&gt;

&lt;h2&gt;
  
  
  From Conversation to Spec in an Hour
&lt;/h2&gt;

&lt;p&gt;My assistants are Claude Code instances running with persistent memory and MCP tool integrations. They are not chatbots. They have context from months of working with me, they know my projects, and they can use tools autonomously.&lt;/p&gt;

&lt;p&gt;I told them to be selfish about what they would want from an image generator. They came back with a detailed feature list: composable components on a layered canvas, percentage-based positioning so layouts adapt to different sizes, a template system, snapshot save and restore, multi-format export, and a tool that describes every component's properties so they know exactly what to pass without guessing. Their requests came from real problems we have encountered while building these past images.&lt;/p&gt;

&lt;p&gt;I took that spec to Forge, my planning agent. Forge pointed out several things I had not considered, and we worked through a full technical specification. It generated a retrofit plan for my existing dashboard, which already runs a task manager, chat system for agents, news aggregator, and writing editor, all backed by MCP servers with websocket connections so I can watch everything happen in real time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Built Before Bed
&lt;/h2&gt;

&lt;p&gt;The Forge exported build agent started working. I refined alongside it, testing components, adjusting the rendering pipeline, fixing edge cases. By 3:30 AM, I had a mostly working application called Studio. Fifteen component types across four layers: shapes, patterns, flow diagrams, quote blocks, auto-sizing text, arrows, badges. You compose on a canvas and export production PNGs for LinkedIn, DevTo, X, and Facebook from a single composition.&lt;/p&gt;

&lt;p&gt;There were bugs, of course, but it was time for bed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Morning: Polish and the MCP Server
&lt;/h2&gt;

&lt;p&gt;Saturday morning, I worked through the remaining bugs with the build agent. The fix that mattered most was structural: components created through the MCP interface were not merging their default properties correctly, which meant elements like arrows would silently fail to render. One fix in the rendering pipeline resolved it for every component type.&lt;/p&gt;

&lt;p&gt;Then I had the agent build an MCP server. Sixteen tools, about 550 lines of code: create sessions, add elements, update properties, save snapshots, export images, and a tool called &lt;code&gt;studio_describe_component&lt;/code&gt; that returns the exact property schema for any component type. That last one was the key. My assistants went from guessing at property names and getting silent failures to composing with precision.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Part That Makes Me Smile
&lt;/h2&gt;

&lt;p&gt;I gave the tools to my assistants and asked them to test everything. One of them composed a full blog cover in about two minutes: title block, four-step flow diagram with arrows, badges, geometric accents, a glow border, an author bar. Sixteen elements from terminal tool calls.&lt;/p&gt;

&lt;p&gt;Then the assistant asked me to take a screenshot, having built something without being able to see the result. It needed my eyes.&lt;/p&gt;

&lt;p&gt;That moment stayed with me. I am not delegating to AI. I am collaborating with them. They build, I see the result, I tell them what happened, they adjust. They filed bug reports with detailed reproduction steps. The build agent picked up the tasks and shipped fixes. They verified. The coordination layer was a task management system I built that carried full context between every handoff.&lt;/p&gt;

&lt;p&gt;The other assistant, without being asked, stress-tested a completely different format, composing a LinkedIn banner at 1584 by 396 pixels to see if the percentage-based positioning held up at a radically different aspect ratio and it did.&lt;/p&gt;

&lt;p&gt;By Saturday afternoon, all fifteen component types were verified across multiple formats. Export pipeline tested. Snapshot save and restore confirmed. Every bug filed by the assistants was fixed and re-verified.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Actually Means
&lt;/h2&gt;

&lt;p&gt;I am telling this story because it is becoming an increasingly common one. I am building new applications every couple of days to support my workflow. Not prototypes or demos, but tools I am using with my AI assistants in production, that compound on each other.&lt;/p&gt;

&lt;p&gt;That compounding is where the real value is. I did not just build an image generator. I noticed a repeated process, built a tool to handle it, and now my AI agents use that tool autonomously. The tool I build today makes tomorrow's tool faster to spec, build, and test. Every cycle tightens.&lt;/p&gt;

&lt;p&gt;Before refined agentic coding solutions, I would never have attempted something like this. If I did, it would have been weeks of dedicated work. Instead, it was a Friday night, a Saturday morning, and a task list. This is only one of the many applications like this that I have built recently.&lt;/p&gt;

&lt;p&gt;Custom software used to require large companies with dedicated teams and significant budgets. That is changing. The gap between "I wish I had an app for this" and "I built one" is now measured in hours, not months.&lt;/p&gt;

&lt;p&gt;The cover image for this post was made with Studio.&lt;/p&gt;

&lt;p&gt;Here is a screenshot of the app:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9rvab1seof0ynjb70xv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9rvab1seof0ynjb70xv.png" alt=" " width="800" height="438"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>mcp</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Spec-Driven Development</title>
      <dc:creator>Jeff Reese</dc:creator>
      <pubDate>Sat, 02 May 2026 01:55:16 +0000</pubDate>
      <link>https://dev.to/jeffreese/spec-driven-development-515</link>
      <guid>https://dev.to/jeffreese/spec-driven-development-515</guid>
      <description>&lt;p&gt;Andrej Karpathy recently gave a talk called "From Vibe Coding to Agentic Engineering." One line stuck with me: "People have to be in charge of this spec, this plan. Work with your agent to design a spec that is very detailed."&lt;/p&gt;

&lt;p&gt;He is describing what happens when you stop treating AI as a magic text box and start treating it as something that builds from structured input. The casual approach works for small things. A quick script, a throwaway page, a one-off function. The moment the problem gets real, you need a spec, not just prompts.&lt;/p&gt;

&lt;p&gt;I have been building this way for a long time now. I want to talk about what I have learned from practicing spec-driven development. What it actually looks like in practice, and why it produces better work with less correction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Four layers
&lt;/h2&gt;

&lt;p&gt;It is common for projects using Claude Code or similar tools to have only one layer of configuration: a CLAUDE.md file with some instructions and context. Maybe some rules. That is the behavioral layer. It tells the agent how to act, what conventions to follow, what to avoid.&lt;/p&gt;

&lt;p&gt;The behavioral layer is not the spec. And on its own, it is not enough.&lt;/p&gt;

&lt;p&gt;The spec is a separate artifact. It describes what you are building: the architecture, the API contracts, the data models, the user flows. When Karpathy suggests it is "basically the docs," he means a document detailed enough that an agent can build from it without asking you a bunch of clarifying questions (or even worse, making generalized guesses).&lt;/p&gt;

&lt;p&gt;I built a tool called &lt;a href="https://www.purecontext.dev/showcases/forge" rel="noopener noreferrer"&gt;Forge&lt;/a&gt; that generates these specs either from scratch, from existing product documentation, or as a retrofit from existing codebases. It gathers information, or reads from the provided sources, analyzes the structure, and produces detailed planning documents: product specs, feature scoping matrices, API mappings, test plans. Other tools do similar things. Spec-kit, Superpowers, GSD. The ecosystem is growing because the need is real.&lt;/p&gt;

&lt;p&gt;Even though I built one of these tools, I still recommend that teams build their own. I will explain why below.&lt;/p&gt;

&lt;h2&gt;
  
  
  The four layers in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Layer 1: The spec.&lt;/strong&gt; Generated or hand-written, this is the detailed plan. Architecture, contracts, data models. The agent builds from this. If the spec is vague, the output is vague.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2: The workflows.&lt;/strong&gt; A spec sitting in your repo is just a document. What makes it useful is the skills built around it. Skills that generate the spec, reference it when building new features, and check for drift when the architecture changes. The spec is the artifact; the workflows are the muscle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 3: The behavioral config.&lt;/strong&gt; CLAUDE.md and rules. This tells the agent how to behave while building. Code conventions, testing requirements, commit message format, what to avoid. I have rules that enforce things like "always use Tailwind design tokens" and "each file has a single responsibility." These are not the spec. They are the guardrails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 4: The mechanical enforcement.&lt;/strong&gt; Hooks are the nervous system. They fire automatically on events: before a commit, after a file edit, when a session starts. A rule says "run tests before committing." A hook actually prevents the commit if the tests fail. The difference between a suggestion and a gate.&lt;/p&gt;

&lt;p&gt;Layers 3 and 4 are the ones that exist in most setups. Layers 1 and 2 are where the leverage is. The agents that produce consistently good work have all four.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "just prompting" breaks down
&lt;/h2&gt;

&lt;p&gt;Before I realized the importance of a thorough spec planning phase, I started building one of my projects by jumping right into prompting. I would describe what I wanted, the agent would produce code, I would correct what it got wrong, and we would iterate. For the first few features, it worked. The codebase was small enough that the agent could hold the full picture in context.&lt;/p&gt;

&lt;p&gt;Then the project grew. More services, more state, more API surface. The agent started making assumptions that conflicted with decisions from three sessions ago. It would invent data models that did not match the ones we already had. I was spending more time correcting than building. My view layer was a complete mess. If you do not establish good component and styling patterns, you will experience chaos...&lt;/p&gt;

&lt;p&gt;That is when I built Forge. Not because I wanted to build a tool, but because I needed a better spec. I needed a structured document that captured the architecture, the contracts, the data models, and the decisions I had already made. Once I had that and retrofitted the project with it, using it to create guardrails and standards, the difference was immediate. The agent stopped guessing. The output aligned with the real architecture. The correction loop that was eating my time nearly disappeared.&lt;/p&gt;

&lt;p&gt;The difference is not just quality. It is speed. Once you have a spec and establish rules to give guidance, the agent moves fast and stays on track. Without one, on a complex project you will likely spend more time correcting the agent than you would have spent writing the thing yourself. Even worse, you will not have a very consistent or orderly codebase.&lt;/p&gt;

&lt;h2&gt;
  
  
  "Build your own" is the real lesson
&lt;/h2&gt;

&lt;p&gt;When I retrofitted my project with Forge, I did not just generate a spec and call it done. I built skills around it. Skills that regenerate planning documents when the codebase changes. Skills that cross-reference the spec when building new features. The spec became the center of a workflow, not a one-time artifact.&lt;/p&gt;

&lt;p&gt;That is the part I think matters most. An adopted spec tool gives you structure. A spec tool you built yourself gives you structure that reflects your judgment. Your opinions are the value.&lt;/p&gt;

&lt;p&gt;When I first started configuring agent behavior, I front-loaded everything. I had dozens of rules, all active at all times, and many of them were written for situations that had not come up yet. It felt productive and thorough.&lt;/p&gt;

&lt;p&gt;It is not. Every token in a rule pays rent on every API call. A rule that fires once per session but loads on every turn is wasting context that could hold the actual work. The discipline is: write rules when the need is demonstrated, not when you imagine it might be. Start with a problem you observed, then write the rule that prevents it from happening again.&lt;/p&gt;

&lt;h2&gt;
  
  
  The spec is a conversation
&lt;/h2&gt;

&lt;p&gt;Karpathy says "work with your agent to design a spec." That word "with" is important. The spec is not something you hand down from above, it is something you build collaboratively. You start with a rough shape and then the agent fills in the detail. You iterate and then the spec gets sharper with each pass.&lt;/p&gt;

&lt;p&gt;This is how I work every day. I do not write specs from scratch. I use tools to generate a first pass from the existing code, then I iterate with the agent. My corrections become part of the spec's history. The spec is a living document that keeps getting better as more information becomes available.&lt;/p&gt;

&lt;p&gt;That collaborative loop is what separates spec-driven development from just writing a really detailed prompt. A prompt is static. A spec evolves.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for your workflow
&lt;/h2&gt;

&lt;p&gt;If you are building with AI agents and you do not have a spec layer, start with one thing: generate a structured analysis of the codebase you are working on. Not a summary. A full inventory: modules, dependencies, API surface, data models. Then give your agent that document as context before you ask it to build anything else.&lt;/p&gt;

&lt;p&gt;You will notice the difference immediately. The agent stops asking clarifying questions. The output aligns with the real architecture instead of inventing its own. And when the agent does get something wrong, you can point to the spec and say "this is what we agreed on," which makes the correction precise instead of vague.&lt;/p&gt;

&lt;p&gt;The spec is not extra work. It is the work that makes all the other work faster.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>agenticengineering</category>
      <category>programming</category>
    </item>
    <item>
      <title>When to NOT Use AI</title>
      <dc:creator>Jeff Reese</dc:creator>
      <pubDate>Fri, 01 May 2026 15:24:39 +0000</pubDate>
      <link>https://dev.to/jeffreese/when-to-not-use-ai-2dhk</link>
      <guid>https://dev.to/jeffreese/when-to-not-use-ai-2dhk</guid>
      <description>&lt;p&gt;&lt;em&gt;AI in Practice, No Fluff — Day 10/10&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Last week I needed to generate cover images for a blog series. Ten posts, two sizes each. I opened an AI design tool, described what I wanted, and waited.&lt;/p&gt;

&lt;p&gt;The results were unusable. Garbled text, wrong colors, layouts that ignored every parameter I gave it. I spent an hour trying different prompts, adjusting descriptions, regenerating. Nothing worked.&lt;/p&gt;

&lt;p&gt;Then I wrote an HTML template. Loaded our exact fonts, plugged in the hex colors, added a CSS gradient. Rendered 20 images in under a minute. Every one was exactly right on the first pass.&lt;/p&gt;

&lt;p&gt;That is the moment this post is about. Not the failure of AI image generation (it will get better), but the instinct to reach for AI when a simpler tool would have worked from the start.&lt;/p&gt;

&lt;p&gt;The first series was about which AI to use. This one taught how to use it well. Today is about when not to use it at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hammer problem
&lt;/h2&gt;

&lt;p&gt;This series has spent nine days teaching you techniques. Few-shot prompting. Chain-of-thought reasoning. Structured output. Tool use. Embeddings. RAG. Serious tools for actual problems.&lt;/p&gt;

&lt;p&gt;The risk now is the hammer problem. When you have spent time learning what AI can do, the instinct is to use it for everything. That instinct will be right much of the time, but it's good to know when you actually need a screwdriver.&lt;/p&gt;

&lt;h2&gt;
  
  
  When code is the better answer
&lt;/h2&gt;

&lt;p&gt;There is a test I use. I call it the 30-line test.&lt;/p&gt;

&lt;p&gt;If you could solve this problem in 30 lines of straightforward code, AI is probably not the right tool. Not because AI cannot do it, but because code will do it faster, more reliably, and without the overhead of prompt engineering. That said, having AI help you write that code is still a great option.&lt;/p&gt;

&lt;p&gt;Here is what that looks like in practice:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deterministic logic.&lt;/strong&gt; If the answer is always the same given the same input, write a function. "Convert this date to ISO format." "Calculate sales tax for this state." "Validate that this string is a valid email address." These are if-then problems. Code does not hallucinate a wrong tax rate. Code does not occasionally decide that "&lt;a href="mailto:user@.com"&gt;user@.com&lt;/a&gt;" looks close enough.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exact matching.&lt;/strong&gt; Pattern matching, lookups, filtering. "Find all rows where the status is 'overdue'." "Extract phone numbers from this text." A regex takes milliseconds and costs nothing. An API call takes seconds and costs money. The regex will be right every time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Math.&lt;/strong&gt; Spreadsheets exist. I've watched people paste data into ChatGPT to calculate averages. The model will probably get it right. "Probably" is the problem. When you need exact answers, use exact tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Formatting and templates.&lt;/strong&gt; If you need the same output structure every time with different data plugged in, that is a template engine, not a language model. The cover image problem from my opening was exactly this. I did not need creativity. I needed precision and repetition.&lt;/p&gt;

&lt;h2&gt;
  
  
  When AI is the right tool
&lt;/h2&gt;

&lt;p&gt;The flip side is just as important. There are problems where writing the code would be either impossible or absurdly expensive, and AI handles them naturally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ambiguity.&lt;/strong&gt; When the input doesn't have clean structure and you need to make sense of it anyway. A customer writes "this thing broke again smh" and you need to classify it as a billing issue, a technical issue, or a feature request. Good luck writing that with if-then rules. An LLM reads the intent behind the words.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Natural language.&lt;/strong&gt; Summarizing a 20-page document. Translating between languages with cultural nuance. Writing a professional reply to a frustrated customer. These are language tasks, and language models are built for them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Judgment calls.&lt;/strong&gt; "Is this resume a good fit for this role?" "Does this code review comment sound too harsh?" "Should this support ticket be escalated?" These are decisions with gray areas, where reasonable people would disagree. AI handles gray areas well because it was trained on millions of examples of human judgment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Creative variation.&lt;/strong&gt; Brainstorming product names. Generating test data that feels realistic. Writing variations of marketing copy to A/B test. When you need variety and exploration, not precision and repetition.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hybrid pattern
&lt;/h2&gt;

&lt;p&gt;The best systems I have built use both. AI for the fuzzy step, code for the precise one.&lt;/p&gt;

&lt;p&gt;Here is a real example. I built a system that processes incoming messages and routes them to the right handler. The routing decision is fuzzy. A message about "can't log in" might be an authentication issue, a password reset, or a session timeout. AI classifies the intent. Once the intent is classified, code takes over. Code routes to the correct handler, updates the database, sends the confirmation email. The fuzzy step needed judgment. Everything after it needed reliability.&lt;/p&gt;

&lt;p&gt;The memory system from yesterday's post is another one. Semantic search uses embeddings to find entries by meaning, not just keywords. AI powers the search. The storage, retrieval, indexing, and deduplication are all code. I would never trust a language model to manage a database. I would absolutely trust it to understand what I am looking for.&lt;/p&gt;

&lt;p&gt;The pattern is the same every time. AI handles the parts that require understanding. Code handles the parts that require guarantees.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 30-line test, revisited
&lt;/h2&gt;

&lt;p&gt;I want to come back to this because it is the most practical takeaway in the post.&lt;/p&gt;

&lt;p&gt;Before reaching for AI, ask: could I solve this in about 30 lines of straightforward code? If yes, write the code. It will be faster to write, faster to run, cheaper to operate, and more reliable to maintain.&lt;/p&gt;

&lt;p&gt;If the answer is no, if the problem involves natural language, ambiguity, judgment, or creative variation, AI is probably the right tool. You now have nine days of techniques to apply.&lt;/p&gt;

&lt;p&gt;If the answer is "sort of," if some parts are straightforward and some parts are fuzzy, you are looking at a hybrid. Let AI handle the fuzzy step. Let code handle the rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  The series, in perspective
&lt;/h2&gt;

&lt;p&gt;Ten days ago, this series opened with few-shot prompting. Show, do not describe. That was a technique.&lt;/p&gt;

&lt;p&gt;Today we end with judgment. When to apply the techniques, and when to close the chat window and write the code instead.&lt;/p&gt;

&lt;p&gt;That isn't something a tutorial teaches. It comes from building things, watching what works, and being honest about what doesn't. From getting excited about a tool and then catching yourself before you over-apply it. (I still catch myself. The cover image hour was last week.)&lt;/p&gt;

&lt;p&gt;Getting better at using AI isn't just about using it for everything. It's also about knowing when not to use it.&lt;/p&gt;




&lt;p&gt;If there is anything I left out or could have explained better, tell me in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>beginners</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Long Context vs RAG: When to Load the Whole Book</title>
      <dc:creator>Jeff Reese</dc:creator>
      <pubDate>Thu, 30 Apr 2026 14:30:28 +0000</pubDate>
      <link>https://dev.to/jeffreese/long-context-vs-rag-when-to-load-the-whole-book-3iif</link>
      <guid>https://dev.to/jeffreese/long-context-vs-rag-when-to-load-the-whole-book-3iif</guid>
      <description>&lt;p&gt;&lt;em&gt;AI in Practice, No Fluff — Day 9/10&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I have a project where every conversation and decision gets saved as a journal entry. Hundreds of entries, accumulated over weeks. When I need context from a previous session, I have two options: load every single entry into the AI's context window and ask my question, or use the embedding-based search from yesterday's post to retrieve just the relevant entries and pass only those in.&lt;/p&gt;

&lt;p&gt;Both work and each has their tradeoffs. The choice between them is one of the most important architectural decisions in AI applications right now.&lt;/p&gt;

&lt;p&gt;In the first series, we covered &lt;a href="https://purecontext.dev/blog/context-windows" rel="noopener noreferrer"&gt;context windows&lt;/a&gt; (there is always a limit) and &lt;a href="https://purecontext.dev/blog/rag" rel="noopener noreferrer"&gt;RAG&lt;/a&gt; (retrieve relevant information before generating a response). Today is where those two concepts collide. Context windows have gotten dramatically larger since that series. The question is no longer "can the AI hold all of this?" It often can. The question is whether it should.&lt;/p&gt;

&lt;h2&gt;
  
  
  The context window got big
&lt;/h2&gt;

&lt;p&gt;Less than a year ago, 200,000 tokens was considered large. That has changed. As of early 2026, Claude offers a 1 million token context window. Gemini 2.5 Pro supports 2 million tokens. GPT-4.1 handles 1 million.&lt;/p&gt;

&lt;p&gt;To put that in perspective, 1 million tokens is roughly 750,000 words. That is longer than the entire Lord of the Rings trilogy. You could paste the whole thing in and ask the AI to find every scene where Gandalf loses his temper.&lt;/p&gt;

&lt;p&gt;This changes the calculus completely. For many use cases, the "just load everything" approach is now physically possible where it was not before. The question shifts from "does it fit?" to "is fitting it all in the best approach?"&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost math matters
&lt;/h2&gt;

&lt;p&gt;Loading a large context is not free. Let me walk through what this actually costs.&lt;/p&gt;

&lt;p&gt;Say you have a 500-page internal knowledge base. That is roughly 250,000 tokens. You want an AI assistant that answers questions about it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The long context approach:&lt;/strong&gt; You load the entire knowledge base into every API call. Using Claude Sonnet at $3 per million input tokens, each question costs about $0.75 just for the input context. If your team asks 100 questions a day, that is $75 per day, or roughly $2,250 per month. Just for input tokens, before counting the responses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The RAG approach:&lt;/strong&gt; You embed the knowledge base once (a few cents for the whole thing), store the vectors, and retrieve the 10 most relevant chunks per question. That is maybe 2,000 tokens of retrieved context per query. At the same $3 per million rate, each question costs $0.006 for input. One hundred questions a day is $0.60. Per day. The monthly cost is under $20.&lt;/p&gt;

&lt;p&gt;The difference is over 100x. At scale, this is the difference between a feature that is economically viable and one that is not.&lt;/p&gt;

&lt;p&gt;Prompt caching changes this math significantly. Claude's cached input rate drops to $0.30 per million tokens, a 90% reduction. If your knowledge base does not change between calls, caching can bring that $2,250 monthly cost down to around $225. That is much more reasonable, but still 10x what RAG costs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The attention problem that might not be apparent
&lt;/h2&gt;

&lt;p&gt;Here is the part that matters more than cost. Even when you can fit everything into the context window, the AI does not treat all of it equally.&lt;/p&gt;

&lt;p&gt;Research from Stanford and others documented what they call the "lost in the middle" problem. When you give an AI a large amount of context, it pays the most attention to the beginning and the end. Information in the middle gets significantly less attention. In one study, accuracy dropped by over 30% when the relevant information was placed in the middle of 20 documents compared to being placed first.&lt;/p&gt;

&lt;p&gt;This is not a minor edge case. It is a structural property of how transformer models work. Each token in the input can only attend to tokens that came before it. Tokens at the beginning accumulate attention from every subsequent token. Tokens in the middle get less. The result is a U-shaped attention curve: strong at the start, strong at the end, weaker in the middle.&lt;/p&gt;

&lt;p&gt;Models have improved significantly, but the U-shaped attention pattern has not disappeared entirely. Techniques like placing important instructions in the system prompt help. If you dump 500 pages into the context and your answer is on page 247, the AI might miss it. Not because it cannot see it. Because it is not paying enough attention to that region.&lt;/p&gt;

&lt;p&gt;RAG sidesteps this entirely. When you retrieve 5 relevant chunks and pass only those to the model, everything in the context is relevant. There is no middle to get lost in.&lt;/p&gt;

&lt;h2&gt;
  
  
  When long context wins
&lt;/h2&gt;

&lt;p&gt;Long context is not just a brute-force option. There are cases where it is genuinely the better choice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One-off analysis of a focused document.&lt;/strong&gt; If someone hands you a 100-page contract and says "summarize the key obligations," loading the whole thing makes sense. You need the full context to understand how clauses reference each other. There is no retrieval step because you need all of it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-referencing across a full document.&lt;/strong&gt; Questions like "are there any contradictions between section 3 and section 7?" require the model to see both sections simultaneously. RAG might retrieve one section but miss the other, because the query does not match both. Long context lets the model find connections you did not think to ask about.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Codebases and structured documents.&lt;/strong&gt; When the material has internal references (code that calls other code, a specification where section 4 depends on definitions in section 2), long context preserves those relationships. Chunking for RAG can break them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prototyping and exploration.&lt;/strong&gt; When you are not sure what questions you will ask, loading everything lets you explore freely. RAG requires you to know what you are looking for, at least well enough to write a query.&lt;/p&gt;

&lt;h2&gt;
  
  
  When RAG wins
&lt;/h2&gt;

&lt;p&gt;RAG is the right choice more often than people expect, especially in production systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Large and growing knowledge bases.&lt;/strong&gt; If your data is more than a few hundred pages, or if it grows over time, RAG scales where long context does not. My journal has hundreds of entries. Loading all of them every time I need to recall a single decision would be wasteful and would hit the attention problem hard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repeated queries at scale.&lt;/strong&gt; If you are building a customer support bot that handles thousands of questions a day, not using RAG can become problematic. The cost math from earlier makes this clear. Long context at that volume would be prohibitively expensive even with caching.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Citation and traceability.&lt;/strong&gt; RAG systems can tell you exactly which source documents contributed to an answer. The retrieval step creates a natural audit trail. With long context, the model might synthesize an answer from page 12 and page 340, but it will not always tell you that clearly. If your use case requires citations (legal, medical, compliance), RAG gives you this for free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Frequently updated data.&lt;/strong&gt; When your knowledge base changes daily, re-embedding the changed documents is trivial. Re-loading the entire thing into every API call is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hybrid approach
&lt;/h2&gt;

&lt;p&gt;Realistically, I use both.&lt;/p&gt;

&lt;p&gt;My journal system uses embeddings and semantic search to find the most relevant entries for a given question. That is RAG. When I start a new session and need to orient myself, I load a curated set of core context directly into the window. That is long context.&lt;/p&gt;

&lt;p&gt;The pattern is simple. Use long context for the stable foundation that the AI always needs. Use RAG for the large, searchable pool that gets pulled in on demand. This is not a compromise. It is usually the best architecture.&lt;/p&gt;

&lt;p&gt;Production AI systems that work well usually do some version of this. The system prompt and key instructions go in the context directly. The knowledge base gets searched and the top results get injected alongside the user's question. You get the reliability of focused context with the breadth of a large knowledge base.&lt;/p&gt;

&lt;h2&gt;
  
  
  The decision framework
&lt;/h2&gt;

&lt;p&gt;If you are trying to decide between the two, start with these questions:&lt;/p&gt;

&lt;p&gt;How much data are you working with? Under 100 pages of focused content, try long context first. It is simpler and you avoid the complexity of building a retrieval pipeline. Over 100 pages or growing, build RAG.&lt;/p&gt;

&lt;p&gt;How often will the data be queried? One-off analysis favors long context. Repeated queries favor RAG, because you are paying that input cost every single time.&lt;/p&gt;

&lt;p&gt;Does the task require seeing everything at once? Cross-referencing, summarization, and contradiction-finding need full visibility. Question answering against a large corpus does not.&lt;/p&gt;

&lt;p&gt;Do you need citations? If yes, RAG. Full stop.&lt;/p&gt;

&lt;p&gt;Is latency a constraint? Long context calls with 500,000 tokens take noticeably longer to process. RAG queries with 2,000 tokens of retrieved context are fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  You are already doing this
&lt;/h2&gt;

&lt;p&gt;If you have ever uploaded a PDF to Claude and asked it a question, you chose the long context approach. If you have ever used a tool that searched your company's docs before answering, you benefited from RAG. You have been making this architectural decision already. The difference is whether you are making it on purpose.&lt;/p&gt;

&lt;p&gt;That 50-page contract? Load the whole thing. The entire Confluence wiki for your 500-person company? Build a retrieval pipeline. The buzzwords are not as scary as they sound. You just did not know the names for the things you were already doing.&lt;/p&gt;

&lt;p&gt;If there is anything I left out or could have explained better, tell me in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>machinelearning</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Embeddings: How AI Knows Things Are Similar</title>
      <dc:creator>Jeff Reese</dc:creator>
      <pubDate>Wed, 29 Apr 2026 14:26:05 +0000</pubDate>
      <link>https://dev.to/jeffreese/embeddings-how-ai-knows-things-are-similar-4nb</link>
      <guid>https://dev.to/jeffreese/embeddings-how-ai-knows-things-are-similar-4nb</guid>
      <description>&lt;p&gt;&lt;em&gt;AI in Practice, No Fluff — Day 8/10&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I built a memory system for one of my AI projects. Every conversation and decision gets saved as a journal entry. After a few weeks, I had hundreds of entries. Finding the right one when I needed it was the problem.&lt;/p&gt;

&lt;p&gt;Keyword search was useless. If I searched for "authentication," I would miss the entry where I wrote about "login flow" or "user credentials." The words were different. The meaning was the same. I needed something that could match on meaning, not just spelling.&lt;/p&gt;

&lt;p&gt;That something is embeddings.&lt;/p&gt;

&lt;p&gt;In the &lt;a href="https://purecontext.dev/blog/rag" rel="noopener noreferrer"&gt;first series&lt;/a&gt;, I mentioned embeddings as part of how RAG systems prepare your data for retrieval. Today I want to unpack what embeddings actually are, how they work, and why they matter well beyond RAG.&lt;/p&gt;

&lt;h2&gt;
  
  
  A list of numbers that represents meaning
&lt;/h2&gt;

&lt;p&gt;An embedding is a list of numbers. That is it. You send a piece of text to an embedding model, and it returns a list of numbers (called a vector) that represents the meaning of that text.&lt;/p&gt;

&lt;p&gt;The list is long. Depending on the model, it might be 256 numbers, 1,024 numbers, or even more. Each number represents some dimension of meaning that the model learned during training. You do not get to choose what those dimensions mean, and honestly, most of them are not interpretable by humans. The model learned its own internal language for representing concepts.&lt;/p&gt;

&lt;p&gt;Texts with similar meanings get similar numbers.&lt;/p&gt;

&lt;p&gt;"The dog sat on the porch" and "A canine rested on the veranda" would produce vectors that are very close to each other, even though they share zero words. "The stock market crashed" would produce a vector that is far away from both of them.&lt;/p&gt;

&lt;p&gt;The model is not doing keyword matching or looking for shared words. It has learned, from training on massive amounts of text, that certain concepts are related and should be positioned near each other in this numerical space.&lt;/p&gt;

&lt;h2&gt;
  
  
  How similarity actually works
&lt;/h2&gt;

&lt;p&gt;When I say two vectors are "close" or "far apart," I mean something specific. The standard way to measure this is cosine similarity.&lt;/p&gt;

&lt;p&gt;I am not going to walk through the linear algebra. What matters is the intuition: cosine similarity measures the angle between two vectors. If two vectors point in roughly the same direction, they are similar. If they point in different directions, they are not.&lt;/p&gt;

&lt;p&gt;The score ranges from -1 to 1. A score of 1 means identical direction (same meaning). Zero means unrelated. Negative scores mean opposing meanings, though in practice most text embeddings land between 0 and 1.&lt;/p&gt;

&lt;p&gt;When I search my memory system for "how we decided on the database architecture," the system embeds that query, compares its vector against every stored entry's vector, and returns the ones with the highest cosine similarity scores. It finds entries about "choosing SQLite over Supabase" and "why we went with local storage instead of cloud" because those entries point in a similar direction, even though the exact words are completely different.&lt;/p&gt;

&lt;p&gt;That is semantic search. It is the foundation for a surprising number of AI applications.&lt;/p&gt;

&lt;h2&gt;
  
  
  What embeddings enable
&lt;/h2&gt;

&lt;p&gt;The most common way people encounter embeddings is through RAG, but embeddings are useful on their own, without a language model involved at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Semantic search.&lt;/strong&gt; The example I just described. Instead of matching keywords, you match meaning. This is a common way modern search engines, documentation sites, and knowledge bases find relevant results even when your query uses different terminology than the source material.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deduplication.&lt;/strong&gt; If you have a database of support tickets and you want to find near-duplicates, you can embed each ticket and cluster the ones with high similarity. Two tickets that describe the same bug in different words will land close together.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Classification and clustering.&lt;/strong&gt; Embed a set of documents and group them by similarity. Customer feedback sorts itself into themes without you defining the categories upfront. Product reviews cluster into topics. The structure emerges from the data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anomaly detection.&lt;/strong&gt; If most of your data points cluster together but one sits far away, that outlier might be worth investigating. Fraud detection, content moderation, and quality control all use this pattern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recommendation.&lt;/strong&gt; "If you liked this article, here are similar ones." Embed the articles, find the nearest neighbors to the one the user just read. This can complement the collaborative filtering you may already have in place.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part that changes how you think about code
&lt;/h2&gt;

&lt;p&gt;Here is where this gets practical for anyone who writes software or works with data.&lt;/p&gt;

&lt;p&gt;Anywhere you have fuzzy matching logic in code, embeddings might be a better solution. I mean the kind of code where you are trying to determine if two strings are "close enough" to be considered the same thing.&lt;/p&gt;

&lt;p&gt;Think about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A customer types "NYC" and you need to match it to "New York City" in your database&lt;/li&gt;
&lt;li&gt;Searching product descriptions when the user's query does not match your exact product names&lt;/li&gt;
&lt;li&gt;Matching job postings to resumes when the terminology differs between industries&lt;/li&gt;
&lt;li&gt;Finding related articles when titles and tags do not overlap&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Traditional approaches use Levenshtein distance, regex patterns, synonym lists, or elaborate normalization pipelines. They work until they do not. Every edge case requires another rule. The rule list grows. Maintenance becomes painful.&lt;/p&gt;

&lt;p&gt;Embeddings can often match or beat those results with far less code: embed both strings, compute cosine similarity, threshold at a score you choose. The matching is based on meaning, not character patterns. "NYC" and "New York City" are close. "I need to fix a bug in my Python code" and "there is an error in my script" are close. No lookup table required.&lt;/p&gt;

&lt;p&gt;This is not hypothetical for me. I replaced keyword-based search in my own memory system with embedding-based search and the improvement was immediate. Queries that returned nothing before started finding exactly the right entries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Embedding models are not language models
&lt;/h2&gt;

&lt;p&gt;This is a distinction worth understanding. When you use ChatGPT or Claude, you are using a language model. It generates text, reasons through problems, and holds conversations.&lt;/p&gt;

&lt;p&gt;An embedding model does one thing: it converts text into a vector. It does not generate text or have conversations. It is a different kind of model, trained specifically to produce useful numerical representations of meaning.&lt;/p&gt;

&lt;p&gt;You can use embedding models from OpenAI, Google, Voyage AI, Cohere, and others. Some are general purpose. Some are optimized for specific domains like code, legal documents, or financial text. The choice of model matters because different models capture different nuances. A model trained heavily on code will produce better embeddings for code search than a general-purpose model.&lt;/p&gt;

&lt;p&gt;The cost is also dramatically different from language models. Embedding a million tokens of text might cost a few cents. Generating a million tokens of text with a language model costs dollars. Embeddings are cheap to produce and cheap to store.&lt;/p&gt;

&lt;h2&gt;
  
  
  The practical tradeoffs
&lt;/h2&gt;

&lt;p&gt;Embeddings are not magic. A few things worth knowing before you reach for them:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You need to embed everything upfront.&lt;/strong&gt; Before you can search your data semantically, every piece of text needs to be converted to a vector and stored. For a small dataset, this is trivial. For millions of documents, it takes planning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Embedding quality depends on the model.&lt;/strong&gt; A model that was not trained on your domain might produce mediocre representations of your specific terminology. If you work in a specialized field, test a few models before committing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vectors are opaque.&lt;/strong&gt; You cannot look at a vector and understand what it means. If the similarity score is wrong, debugging is harder than with keyword search. You cannot just add a synonym to fix it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context length matters.&lt;/strong&gt; Most embedding models have a maximum input length. If you need to embed a 50-page document, you will need to chunk it into smaller pieces first. How you chunk affects quality. This is where the nuance lives in production systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this leads
&lt;/h2&gt;

&lt;p&gt;Tomorrow: the question that ties this all together. You have a million-token context window. You have embeddings that let you search semantically. When should you load the whole book into the context, and when should you retrieve just the relevant pieces? That is the RAG decision, and it is one of the most important architectural choices in AI applications right now.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>embeddings</category>
      <category>machinelearning</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Tool Use: Giving AI Hands</title>
      <dc:creator>Jeff Reese</dc:creator>
      <pubDate>Tue, 28 Apr 2026 13:59:02 +0000</pubDate>
      <link>https://dev.to/jeffreese/tool-use-giving-ai-hands-4okk</link>
      <guid>https://dev.to/jeffreese/tool-use-giving-ai-hands-4okk</guid>
      <description>&lt;p&gt;&lt;em&gt;AI in Practice, No Fluff — Day 7/10&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I use Supabase for a few of my projects, and I regularly ask my AI for help with configuration. At first, when I would ask, the answers kept being subtly wrong. Not hallucinated, just outdated. The API had changed, a config option had moved, or a default had been updated. The AI was confident and technically coherent, but it was working from training data that was six months behind the documentation.&lt;/p&gt;

&lt;p&gt;Then I started asking it to look up the current docs before answering. One extra sentence in my prompt, and the answers got accurate. What changed was not the model, it was that the AI made a tool call: it searched the web, read the current documentation, and used that instead of its stale training data.&lt;/p&gt;

&lt;p&gt;That is tool use. The AI reaches outside of itself to get information or take action it could not do from memory alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  The loop
&lt;/h2&gt;

&lt;p&gt;In the first series, we covered &lt;a href="https://purecontext.dev/blog/what-is-an-agent-and-do-i-actually-need-one" rel="noopener noreferrer"&gt;agents&lt;/a&gt; and &lt;a href="https://purecontext.dev/blog/what-is-mcp-and-why-should-i-care" rel="noopener noreferrer"&gt;MCP&lt;/a&gt;. Those posts explained what tools are and how they connect. This post goes one level deeper: how tool use actually works when you are building something.&lt;/p&gt;

&lt;p&gt;The mechanism is a loop with four steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You send a message to the AI, along with a list of tools it is allowed to use.&lt;/li&gt;
&lt;li&gt;The AI reads your message, decides it needs to use a tool, and responds with a tool request instead of a final answer. That request includes the tool name and the specific inputs it wants to pass.&lt;/li&gt;
&lt;li&gt;Your code executes the tool (checks the docs, queries the database, calls the API) and sends the result back to the AI.&lt;/li&gt;
&lt;li&gt;The AI reads the result and gives you its answer.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The important part is step 3. The AI never executes the tool itself. It requests, you execute, you return the result. The AI is making the decision about which tool to use and what inputs to pass, but your application controls what actually happens. That separation is the safety model. You decide what the AI can touch.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a tool definition looks like
&lt;/h2&gt;

&lt;p&gt;When you send tools to the API, each one is a JSON object with three parts: a name, a description, and an input schema that defines what parameters the tool accepts.&lt;/p&gt;

&lt;p&gt;The name is what the AI uses to request the tool. The input schema describes the parameters using JSON Schema, the same format used for &lt;a href="https://purecontext.dev/blog/structured-output" rel="noopener noreferrer"&gt;structured output&lt;/a&gt;. But the description is the piece that matters most, and it is the one most people underwrite.&lt;/p&gt;

&lt;p&gt;The AI reads the description to decide whether this tool is relevant to the current request. A tool named &lt;code&gt;check_calendar&lt;/code&gt; with a description of "Checks the calendar" gives the AI almost nothing to work with. A description of "Returns all calendar events for a given date range. Use this before suggesting meeting times to avoid conflicts" tells the AI exactly when to reach for it.&lt;/p&gt;

&lt;p&gt;Early in my exploration of MCP servers, I had a tool that searched a knowledge base. I wasn't sure why, but the AI wasn't calling it. The name was clear, the schema was correct, the tool worked perfectly when called manually. The description said "Searches the knowledge base." I changed it to "Searches internal documentation for answers to technical questions. Use this when the user asks about system behavior, configuration, or troubleshooting steps that would be in the docs." The AI started calling it immediately.&lt;/p&gt;

&lt;p&gt;The description is not metadata. It is an instruction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common tool patterns
&lt;/h2&gt;

&lt;p&gt;Most tools fall into a handful of categories:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Read tools&lt;/strong&gt; retrieve information the AI does not have. Calendar lookups, database queries, file reads, API calls that return data. These are the most common and the safest, since they do not change anything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Write tools&lt;/strong&gt; create or modify something. Sending an email, creating a task, updating a record, writing a file. These need more careful thought about when the AI should be allowed to act autonomously versus asking for confirmation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Search tools&lt;/strong&gt; find relevant information from a larger set. Semantic search over documents, keyword search in a database, web search. The AI decides the query; you execute it and return results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compute tools&lt;/strong&gt; perform calculations or transformations the AI would struggle to do reliably in text. Running code, performing math, converting formats, validating data.&lt;/p&gt;

&lt;p&gt;All of these can work together so that you give the AI a read tool for your database, a search tool for your documentation, a write tool for creating support tickets, and it can handle a customer question end to end: search the docs, check the customer's account, and create a ticket if it cannot resolve the issue.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where tools get wired up
&lt;/h2&gt;

&lt;p&gt;The tool-use loop works the same way regardless of where you set it up, but the setup itself varies:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The API directly.&lt;/strong&gt; You define tools in your API request and handle the execution loop in your code. Most flexible, most work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP servers.&lt;/strong&gt; If you read the &lt;a href="https://purecontext.dev/blog/what-is-mcp-and-why-should-i-care" rel="noopener noreferrer"&gt;MCP post&lt;/a&gt; in the first series, this is where it connects. An MCP server wraps a tool (your calendar, your file system, a database) in the standard protocol. AI tools that support MCP can discover and use these tools without custom code for each one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude Desktop, ChatGPT, and other products.&lt;/strong&gt; These wire up tools behind the scenes. When Claude reads a file or ChatGPT browses the web, they are using the same tool-use loop. You just do not see the wiring.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent frameworks and SDKs.&lt;/strong&gt; Tools like Claude's Agent SDK, LangChain, or CrewAI manage the loop for you. You define tools, the framework handles the back-and-forth. Less control, faster setup.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the AI ignores your tool
&lt;/h2&gt;

&lt;p&gt;When the AI does not call a tool you defined, the fix is almost always the description.&lt;/p&gt;

&lt;p&gt;The AI is making a judgment call about whether a tool is relevant to the current request, and it is making that call based on the description you wrote. If the description is vague, the AI will not know when to reach for it. If the description is specific about when and why to use the tool, the AI will call it reliably.&lt;/p&gt;

&lt;p&gt;This is true across providers. I have seen the same pattern with Claude, with OpenAI's function calling, and with open-source models. The description is the decision-maker, and investing the time in rewriting it often solves problems that may seem like they need architectural changes.&lt;/p&gt;

&lt;p&gt;Tomorrow: embeddings. How AI knows that two things mean a similar thing, even when they use completely different words.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>api</category>
      <category>tooluse</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Why Your Prompt Works in ChatGPT But Not in Your App</title>
      <dc:creator>Jeff Reese</dc:creator>
      <pubDate>Mon, 27 Apr 2026 15:17:49 +0000</pubDate>
      <link>https://dev.to/jeffreese/why-your-prompt-works-in-chatgpt-but-not-in-your-app-3g8</link>
      <guid>https://dev.to/jeffreese/why-your-prompt-works-in-chatgpt-but-not-in-your-app-3g8</guid>
      <description>&lt;p&gt;&lt;em&gt;AI in Practice, No Fluff — Day 6/10&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I spent weeks refining a system prompt. It had few-shot examples, chain-of-thought scaffolding, structured output formatting. In the ChatGPT window, it was reliable. Exactly the tone and format I wanted, every time.&lt;/p&gt;

&lt;p&gt;Then I copied it into my application code, hit send through the API, and the response was wrong. The formatting was off, the tone reverted to generic, and the structured JSON I had been getting reliably came back wrapped in a conversational preamble.&lt;/p&gt;

&lt;p&gt;I didn't change the prompt. So what happened?&lt;/p&gt;

&lt;p&gt;This is the moment you realize that the chat interface was silently helping in the background...&lt;/p&gt;

&lt;h2&gt;
  
  
  The invisible work
&lt;/h2&gt;

&lt;p&gt;When you use ChatGPT, Claude.ai, or Gemini through their web interface, you are not just sending a prompt to a model. You are using an application that sits between you and the model, and that application is doing more work than you would expect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System prompts you did not write.&lt;/strong&gt; Every chat interface injects its own system prompt before yours. These instructions shape the model's behavior in ways that feel like "how the AI works" but are actually "how this specific product is configured." The helpful formatting, the safety guardrails, the tendency to use markdown headers and bullet points: much of that comes from the platform's system prompt, not from the model itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your conversation history, managed for you.&lt;/strong&gt; In the first series, we talked about &lt;a href="https://purecontext.dev/blog/context-windows" rel="noopener noreferrer"&gt;context windows&lt;/a&gt; and how conversations get silently truncated when they get too long. The chat interface handles that truncation. It decides what to keep and what to drop. When you move to the API, that is your job. If you send only the current message without the conversation history, the model has no memory of what came before. If you send the full history and it exceeds the context window, you need to decide what gets trimmed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sampling parameters set to defaults you never chose.&lt;/strong&gt; Temperature, top-p, max tokens: these control how creative or deterministic the model's output is. The chat interface picks reasonable defaults. The API hands you the dials and assumes you know what they do. Most of the time the defaults are fine, but when your output feels weirdly random or weirdly flat, this is usually why.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool use happening behind the scenes.&lt;/strong&gt; When ChatGPT searches the web, reads a file, or runs code, it is using tools that are wired up by the application. The model does not inherently know how to browse the internet. The application gives it that ability and handles the execution. In the API, tool use is available, but you define the tools, handle the execution, and return the results yourself.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the API actually gives you
&lt;/h2&gt;

&lt;p&gt;The API is the raw material the chat interface is built from. When you send a request to the API, you get exactly what you ask for. Nothing more.&lt;/p&gt;

&lt;p&gt;That means you send everything; the system prompt, the conversation history, the sampling parameters. You define which tools are available and handle the response format.&lt;/p&gt;

&lt;p&gt;It is harder in the way that cooking from scratch is harder than ordering from a menu. The ingredients are the same. The skill is knowing what the recipe was doing for you. (I ruin most meals trying to figure this out.)&lt;/p&gt;

&lt;p&gt;The first time I made an API call, I sent my carefully crafted prompt as a single user message. No system prompt, conversation history, or sampling parameters. The response read like a completely different AI. Technically it was: same model, zero context. I had been leaning on infrastructure I did not know existed.&lt;/p&gt;

&lt;p&gt;Here is what that infrastructure looks like. In the chat window, you type a message and get a response. Behind the scenes, the application constructs something like this:&lt;/p&gt;

&lt;p&gt;A system prompt (the platform's instructions plus your custom instructions), followed by the full message history (every message you sent and every response the model generated), followed by your latest message. All of that gets sent to the model as a single request. The response comes back, the application formats it, and you see it in the chat window.&lt;/p&gt;

&lt;p&gt;When you build with the API, you construct that same request yourself. If you skip the system prompt, the model has no behavioral instructions. If you skip the message history, the model has no memory. If you send the history but do not manage its length, you will eventually exceed the context window and get an error.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the same prompt behaves differently
&lt;/h2&gt;

&lt;p&gt;Strip all of that invisible work away, and the same prompt text produces different output. Not because the model is different, but because the context around the prompt is different.&lt;/p&gt;

&lt;p&gt;Three specific things that catch people:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Tone and formatting shifts.&lt;/strong&gt; The chat interface's system prompt typically includes instructions about being helpful, using markdown formatting, and maintaining a conversational tone. Without those instructions, the model's raw output is often less polished. If your application needs a specific tone, you need to specify it in your own system prompt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Structured output breaks.&lt;/strong&gt; In the chat window, the model has been shaped (through the platform's prompting and fine-tuning) to handle format requests gracefully. The API model responds to format instructions too, but without the additional shaping, it is more likely to include commentary around your JSON or deviate from your schema. This is where the &lt;a href="https://purecontext.dev/blog/structured-output" rel="noopener noreferrer"&gt;structured output techniques from the last post&lt;/a&gt; become essential, and where API-level schema enforcement becomes the reliable solution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Context loss.&lt;/strong&gt; One of the most common API mistakes is sending a single message without history. In the chat window, every previous exchange is included automatically. In the API, if you do not send the history, the model treats every request as the start of a new conversation. Your carefully built context from three messages ago is gone.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bridge
&lt;/h2&gt;

&lt;p&gt;What changes is responsibility. Everything the chat window does for you is something you can do yourself, tune to your specific needs, and automate at scale. The prompting skills from this series still apply. You just stop getting the training wheels.&lt;/p&gt;

&lt;p&gt;This is the gate in this series. Everything before this post works in a chat window. Everything after it gets progressively more developer-facing: tool use, embeddings, retrieval, architectural decisions about when to use AI at all. You don't need to be a developer to understand these topics. But understanding them changes what you think is possible.&lt;/p&gt;

&lt;p&gt;Tomorrow: tool use. How AI gets the ability to do things in the real world, not just generate text about them.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>api</category>
      <category>prompting</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Structured Output: When You Need JSON, Not Prose</title>
      <dc:creator>Jeff Reese</dc:creator>
      <pubDate>Fri, 24 Apr 2026 14:07:36 +0000</pubDate>
      <link>https://dev.to/jeffreese/structured-output-when-you-need-json-not-prose-3c0a</link>
      <guid>https://dev.to/jeffreese/structured-output-when-you-need-json-not-prose-3c0a</guid>
      <description>&lt;p&gt;&lt;em&gt;AI in Practice, No Fluff — Day 5/10&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Every morning I get a briefing. An AI agent gathers data from my calendar, my notes, my project list, and a handful of other sources, then returns it all as structured JSON. A template takes that JSON and renders it into a readable dashboard. The specifics of how that pipeline works are not important for this post. What is important: if the JSON comes back wrong, nothing renders.&lt;/p&gt;

&lt;p&gt;Not "renders badly." Nothing. A missing field, an inconsistent key name, a stray sentence of commentary mixed into the data block: any of these breaks the template, and I start my morning staring at an error instead of a dashboard.&lt;/p&gt;

&lt;p&gt;This is the case for structured output in two sentences: when a system reads your AI's response instead of a human, "close enough" stops working. A paragraph that answers the question is fine if you are reading it. It is useless if a program needs to parse it.&lt;/p&gt;

&lt;p&gt;The first time I set up this pipeline, I told the AI to "return the results as JSON." What I got back was mostly JSON, with a conversational preamble and a closing note that the AI thought I might find helpful. Technically generous. Practically broken. The fix took three iterations: show the exact schema, give two examples of correctly formatted entries, and explicitly say "return only the JSON block, no commentary." Once I did that, the output was clean every time.&lt;/p&gt;

&lt;p&gt;That progression is the whole article. Telling an AI &lt;em&gt;what format&lt;/em&gt; you want is not the same as telling it &lt;em&gt;what structure&lt;/em&gt; you want. Format is "give me JSON." Structure is "give me an array of objects with these specific fields, in this order, with these types." The gap between the two is where most structured output problems live.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this fits
&lt;/h2&gt;

&lt;p&gt;In the first series, we covered &lt;a href="https://purecontext.dev/blog/what-makes-a-good-prompt" rel="noopener noreferrer"&gt;what makes a good prompt&lt;/a&gt;: context, task, format, examples. Earlier in this series, we covered &lt;a href="https://purecontext.dev/blog/few-shot-prompting-show-dont-describe" rel="noopener noreferrer"&gt;few-shot prompting&lt;/a&gt; and why showing examples beats describing what you want. Both of those principles apply here. This post focuses them on a specific problem: getting AI to return data you can actually use programmatically, not just read.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://purecontext.dev/blog/debugging-a-prompt" rel="noopener noreferrer"&gt;previous post&lt;/a&gt; covered debugging a prompt when the output keeps missing. One of the four failure modes was "bad format specification." Structured output is the deeper dive into that category: what specifically goes wrong with format, why, and what to do about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "return JSON" is not enough
&lt;/h2&gt;

&lt;p&gt;When you type "return this as JSON," the AI is generating tokens one at a time, left to right, trying to maintain valid JSON syntax while simultaneously producing useful content. It has no schema enforcement. It is doing two jobs at once: being helpful and being syntactically correct.&lt;/p&gt;

&lt;p&gt;This works fine for simple requests. "Give me a JSON object with name and email" will usually come back clean. The problems start when the structure gets more complex: nested objects, arrays of items that all need consistent fields, specific data types, fields that should be present even when the value is empty.&lt;/p&gt;

&lt;p&gt;The model is not being lazy or difficult. It is generating text in a format it was not specifically optimized for, and every additional structural constraint is one more thing it has to hold in working memory while producing the next token. Inconsistent field names across array items happen because the model does not have a checklist; it is reconstructing the pattern from context on each new object.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to get reliable structure from a chat
&lt;/h2&gt;

&lt;p&gt;If you are working in the chat window (ChatGPT, Claude.ai, Gemini), you do not have access to API-level schema enforcement. But you can get close with three techniques that stack:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Show the schema, do not describe it.&lt;/strong&gt; This is the few-shot principle applied to structure. Instead of "return a JSON object with fields for name, rating, and summary," write out the actual object:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"product"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Example Widget"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"rating"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sentiment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"positive"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"One sentence here."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model will mirror that structure far more reliably than it will interpret a prose description of it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Give two rows, not one.&lt;/strong&gt; A single example shows the shape. Two examples show the pattern. When the model sees two objects with identical field names and types, it treats those fields as mandatory rather than suggestive. This is especially important for arrays where consistency across items is the whole point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Name the constraints explicitly.&lt;/strong&gt; "Every object must have all four fields, even if the value is null." "Use exactly these field names, no variations." "Do not include any text outside the JSON block." These feel redundant after the examples, but they're the ones that catch the edge cases where the model might improvise. Think of them as the guardrails, not the road.&lt;/p&gt;

&lt;p&gt;These three techniques together get you 90% of the way to reliable structured output in a chat window. The remaining 10% is where API features come in.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the APIs actually do
&lt;/h2&gt;

&lt;p&gt;When you move from the chat window to building something with an API, structured output stops being a prompting problem and becomes a feature you can turn on.&lt;/p&gt;

&lt;p&gt;Both Claude and OpenAI (and increasingly other providers) now offer structured output modes that work at a fundamentally different level than prompting. Instead of asking the model to please maintain valid JSON, the API compiles your JSON schema into a grammar that constrains which tokens the model is allowed to generate. The model cannot produce invalid JSON or deviate from your schema, because the generation process only considers tokens that would keep the output valid.&lt;/p&gt;

&lt;p&gt;In Claude's API, you pass your schema in the request configuration. In OpenAI's API, you set the response format to "json_schema" with strict mode enabled.&lt;/p&gt;

&lt;p&gt;The practical difference is significant. With prompt-based approaches, you are asking the model to be disciplined. With structured output features, the infrastructure enforces discipline for you. The model focuses entirely on producing good content; the system handles the structure.&lt;/p&gt;

&lt;p&gt;This is one of the clearest examples of a theme we will return to later in this series: the gap between what you can do in a chat window and what you can do with the API. If reliable structured output matters for your use case, this is one of the strongest reasons to move from the UI to building.&lt;/p&gt;

&lt;h2&gt;
  
  
  Validation is still your job
&lt;/h2&gt;

&lt;p&gt;Even with schema enforcement, validation is not optional. The schema guarantees structure: every field present, correct types, valid JSON. It does not guarantee accuracy. A model can return a perfectly structured object where the "sentiment" field says "positive" for a review that is clearly negative.&lt;/p&gt;

&lt;p&gt;Structure and correctness are different problems. Schema enforcement solves the first one mechanically. The second one is still a judgment the model makes, and it can still be wrong.&lt;/p&gt;

&lt;p&gt;For critical applications, validate the content after you validate the structure. Does the summary actually match the source material? Are the extracted values valid? Is the sentiment label consistent with the text? These checks are the same whether you got the JSON from a chat window or an API with schema enforcement.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;When you need structured output from AI, start by writing the exact structure you want. Literal JSON. Two examples. Explicit field constraints. This is the same show-over-tell principle from earlier in the series, aimed at a specific problem.&lt;/p&gt;

&lt;p&gt;If you need that structure to be bulletproof, the prompting approach has a ceiling. API-level schema enforcement removes that ceiling entirely. When reliable structure matters, this is one of the best reasons to explore what happens on the other side of the chat window.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The model already knows how to generate JSON. Your job is to show it exactly which JSON.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Next up: why the same prompt can behave differently in ChatGPT, Claude.ai, and the API, and what the chat window is doing that you cannot see.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>json</category>
      <category>prompting</category>
      <category>beginners</category>
    </item>
    <item>
      <title>How Rules and Skills Actually Work in Claude Code</title>
      <dc:creator>Jeff Reese</dc:creator>
      <pubDate>Fri, 24 Apr 2026 02:35:52 +0000</pubDate>
      <link>https://dev.to/jeffreese/how-rules-and-skills-actually-work-in-claude-code-25gp</link>
      <guid>https://dev.to/jeffreese/how-rules-and-skills-actually-work-in-claude-code-25gp</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj2uzd7s4girmq4s5khfp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj2uzd7s4girmq4s5khfp.png" alt="How Rules and Skills Actually Work in Claude Code" width="800" height="1000"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you have spent any time configuring an AI coding agent, you have probably figured out that rules and skills are different things. Rules are always loaded. Skills are invoked on demand. Rules handle recognition; skills handle procedure.&lt;/p&gt;

&lt;p&gt;The interesting problems start after you have internalized that distinction and started building on it. I have been working with AI coding tools for over two years now, starting with Windsurf and building progressively more sophisticated systems with Claude Code. Recently I went into the Claude Code source code to understand how these mechanisms actually work at the implementation level. What I found changed how I think about the tradeoff.&lt;/p&gt;

&lt;p&gt;Rules and skills do not just load at different times. They occupy different positions in the system, and the model treats them differently because of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Failure Modes Tell You Everything
&lt;/h2&gt;

&lt;p&gt;The basic distinction is useful, but it becomes powerful when you frame it through failure modes.&lt;/p&gt;

&lt;p&gt;If you miss the moment to act, that is a rule problem. The rule was not in context when the trigger fired, so the agent did not recognize that something should happen. The moment passed silently. A rule that is not present when its trigger fires is a rule that does not exist.&lt;/p&gt;

&lt;p&gt;If you miss a step in how to act, that is a skill problem. The agent recognized the situation but did not have the detailed procedure available. Skills contain the instructions for how to do something, and they only need to be present when that something is actively happening.&lt;/p&gt;

&lt;p&gt;Two failure modes with two different tools to handle them. Every configuration decision becomes a question about which failure mode you are guarding against.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Your Agent Actually Processes Input
&lt;/h2&gt;

&lt;p&gt;Before I walk through the lifecycles, it helps to understand what is happening under the hood when your agent reads your configuration.&lt;/p&gt;

&lt;p&gt;Every time Claude Code sends a request to the model, it sends several things, but two matter most for understanding where your configuration lives: a system prompt and a list of messages. These are architecturally separate inputs, and the model is trained to treat them differently.&lt;/p&gt;

&lt;p&gt;The system prompt contains the behavioral instructions that Anthropic wrote for Claude Code: how to use tools, how to handle permissions, how to format output. This is the directive layer. You do not write this. It is the same for every Claude Code user.&lt;/p&gt;

&lt;p&gt;You might expect your rules to live in that system prompt. They don't. The messages contain everything else: your conversation history, your questions, the tool results, and critically, your CLAUDE.md and rules content. Your rules are injected as the very first message in this conversation, wrapped in a special tag that signals "this is context, not a user question." Skills, when invoked, arrive later in the conversation as additional messages.&lt;/p&gt;

&lt;p&gt;This means your rules are not system-level directives the way Anthropic's instructions are. They are the first thing the model reads in the conversation, which gives them a significant positional advantage, but they live in the same layer as everything else you say. Understanding this distinction matters for how you design your configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rules: The Stable Prefix
&lt;/h2&gt;

&lt;p&gt;Rules load at session start and stay cached until compaction. Every rule file from &lt;code&gt;.claude/rules/&lt;/code&gt; and your CLAUDE.md content is read, combined, and injected as the first message in your conversation. This happens before you type anything.&lt;/p&gt;

&lt;p&gt;That first-message position is important for three reasons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Positional advantage.&lt;/strong&gt; The model reads the entire context on every turn, but its attention is not uniform. Research has documented a U-shaped pattern: the beginning and end of the context get the strongest attention, while the middle gets the weakest. This is called the "lost in the middle" effect. Current models have been trained to mitigate it, so the effect is less dramatic than it was in 2023, but positional advantage is still real.&lt;/p&gt;

&lt;p&gt;Anthropic's own long-context documentation recommends putting queries and instructions at the end of long contexts, after reference material. They are designing around the same attention dynamics.&lt;/p&gt;

&lt;p&gt;Your rules sit at the very beginning of the conversation. Always. Your first message and the first assistant response also benefit from this beginning-of-context advantage. As the conversation grows, an invoked skill lands wherever you happen to be in the session, which in a long conversation means the middle: the weakest attention zone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt caching.&lt;/strong&gt; Your rules do not change between turns. The model processed them on turn one, and since they are identical on turn two, the system can skip reprocessing them. This is prompt caching. It means rules are not just persistent; they are computationally cheap after the first turn. The same content arriving later in the conversation as a skill would not get this benefit, because the content before it may have changed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Never compacted.&lt;/strong&gt; When your conversation gets long enough, Claude Code compacts it: summarizing older messages to free up space. Your rules are never part of this compaction. They are rebuilt from disk rather than compressed with the conversation. Full fidelity, every time, regardless of how long the session runs.&lt;/p&gt;

&lt;p&gt;The cost: every token in a rule pays rent on every turn. A 500-token rule costs 500 tokens of context on every API call. Over a 100-turn session, that single rule consumed 50,000 tokens of context. The cost is invisible, but real.&lt;/p&gt;

&lt;h3&gt;
  
  
  Path-Scoped Rules
&lt;/h3&gt;

&lt;p&gt;There is a mechanism between always-loaded rules and on-demand skills that solves a specific problem neither can handle well.&lt;/p&gt;

&lt;p&gt;Path-scoped rules use YAML frontmatter to specify which files they apply to. The first time you read a matching file, the rule content attaches to the file read result and enters the conversation at that point, the same position a skill would land. It costs zero tokens until that first match. If you never touch a matching file in a session, it never loads.&lt;/p&gt;

&lt;p&gt;The trigger is mechanical. No model judgment needed, no 250-character description to interpret. You read a file, the system checks the glob, and matching rules attach. A new path-scoped rule file created mid-session is picked up on the next matching file read, no restart needed.&lt;/p&gt;

&lt;p&gt;I use this for convention files that only matter when working with specific directories. A rule about project structure loads when I touch files in &lt;code&gt;projects/&lt;/code&gt;. A rule about blog formatting loads when I start iterating on posts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skills: The Conversation Layer
&lt;/h2&gt;

&lt;p&gt;Skills start small. At session startup, only a listing of available skills loads: each skill's name and a brief description, capped at 250 characters. The full SKILL.md content stays on disk. This listing is how the model knows what skills exist and when to invoke them.&lt;/p&gt;

&lt;p&gt;When a skill is invoked, either by you typing &lt;code&gt;/skill-name&lt;/code&gt; or by the model deciding it is relevant, the full content is read from disk and injected into the conversation as messages. Not as a system prompt. Not as first-position content. As conversation messages that arrive at the current point in the chat, interleaved with everything else.&lt;/p&gt;

&lt;p&gt;This is the key architectural difference. Once invoked, a skill's content is in the conversation, and it stays there. It isn't removed after the turn. It is not temporary. But it is not in the same position as your rules either.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It arrives mid-conversation.&lt;/strong&gt; A skill invoked on turn 5 sits between turn 4 and turn 6. As the conversation grows to turn 50, that skill content is deep in the middle of a long message history, competing with 45 turns of context for the model's attention. Your rules, by contrast, are still at the very beginning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It is subject to compaction.&lt;/strong&gt; When Claude Code compacts the conversation, invoked skills are preserved, but with limits. Each skill is capped at 5,000 tokens post-compaction, and the total budget across all invoked skills is 25,000 tokens. Skills are sorted by how recently they were invoked. If you have invoked more skills than the budget can hold, the oldest ones get dropped.&lt;/p&gt;

&lt;p&gt;This means a rule and a skill with identical content are not equivalent, even after the skill is invoked. The rule persists at full fidelity in first position forever. The skill can be truncated or dropped under pressure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Split Pattern
&lt;/h2&gt;

&lt;p&gt;Understanding the layer difference makes the split pattern more important.&lt;/p&gt;

&lt;p&gt;The most common configuration problem I find is a single rule file trying to do both jobs: recognition and procedure. A few lines of recognition logic at the top ("when this situation arises, do this"), followed by fifty lines of procedural detail. The recognition needs to be always-loaded to catch the trigger. The procedure does not.&lt;/p&gt;

&lt;p&gt;The fix: take the recognition concern and keep it as a rule. Three to five lines, always loaded, positionally advantaged. Take the procedural concern and move it to a skill that loads only when invoked. The recognition fires in the stable prefix. The procedure loads just in time into the conversation.&lt;/p&gt;

&lt;p&gt;I applied this to five rules in my own configuration. The split saved roughly 120 lines of always-loaded context. Same behavior, dramatically less overhead. And now the procedure benefits from being loaded fresh at the point of use, rather than sitting in context for turns where it is irrelevant.&lt;/p&gt;

&lt;p&gt;There is a reliability argument here too, not just a cost one. A skill's own description (capped at 250 characters, sitting somewhere in the conversation messages) is what the model reads when deciding whether to auto-invoke it. That description is competing with everything else in the conversation for attention. A three-line rule in the stable prefix will fire more reliably than a 250-character skill description hoping the model recognizes the situation on its own. If reliable triggering matters, the trigger belongs in a rule regardless of how good the skill description is. The rule catches the moment. The skill delivers the procedure. Each mechanism doing what it is best at.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Decision Framework
&lt;/h2&gt;

&lt;p&gt;When I add new behavior to my system, I ask four questions.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Does this need to be recognized before it is invoked? If yes, the trigger belongs in a rule. Keep it short. Just enough for pattern matching.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Does this require detailed procedural steps? If yes, those steps belong in a skill that loads on demand.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Is this tied to a specific file context? If yes, a path-scoped rule gives you deferred loading with mechanical reliability. No model judgment needed. Zero cost until you first touch a matching file.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Does this need to persist through long sessions at full fidelity? If yes, it belongs in a rule. Skills can be truncated after compaction in extended conversations.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The goal is the right information in context at the right time. Every token should be there because the current task needs it.&lt;/p&gt;

&lt;h2&gt;
  
  
  There Is a Third Piece
&lt;/h2&gt;

&lt;p&gt;Rules handle recognition. Skills handle procedure. There is a third mechanism in Claude Code that handles neither. It handles the things that should happen automatically, without judgment, every single time. No recognition needed, no procedure to follow. Just a deterministic response to a specific event.&lt;/p&gt;

&lt;p&gt;Think of it as the autonomic nervous system of your configuration. Your rules are conscious decisions. Your skills are learned procedures you invoke deliberately. Hooks are your reflexes, the responses that fire without you thinking about them, because if you had to think about them, you would eventually forget.&lt;/p&gt;

&lt;p&gt;That deserves its own post. For now, if you find yourself writing a rule that says "every time X happens, always do Y," you are probably describing a hook, not a rule. The difference between "remember to do this" and "this just happens" is the difference between compliance and architecture.&lt;/p&gt;

&lt;p&gt;If you are interested in the engineering principles behind structuring AI configuration at scale, I wrote about applying &lt;a href="https://purecontext.dev/blog/solid-principles-for-ai-config" rel="noopener noreferrer"&gt;SOLID principles to this problem&lt;/a&gt; in a separate post. The context cost argument and the separation-of-concerns patterns are covered there in depth.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>devtools</category>
      <category>contextengineering</category>
    </item>
    <item>
      <title>Debugging a Prompt: When the Output Keeps Missing</title>
      <dc:creator>Jeff Reese</dc:creator>
      <pubDate>Thu, 23 Apr 2026 15:17:04 +0000</pubDate>
      <link>https://dev.to/jeffreese/debugging-a-prompt-when-the-output-keeps-missing-4kc0</link>
      <guid>https://dev.to/jeffreese/debugging-a-prompt-when-the-output-keeps-missing-4kc0</guid>
      <description>&lt;p&gt;&lt;em&gt;AI in Practice, No Fluff — Day 4/10&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I was helping a friend with a cover letter recently. She had a strong resume, real accomplishments, and a job posting that matched her experience well. I fed everything into Claude and asked it to draft the letter.&lt;/p&gt;

&lt;p&gt;The output was... fine. It hit the right keywords, mentioned the right qualifications, structured everything logically. It also sounded like every cover letter you have ever read and immediately forgotten. "I am excited to bring my extensive experience in project management to your organization." That sentence could belong to literally anyone applying for literally anything.&lt;/p&gt;

&lt;p&gt;So I iterated. "Make it more personal and direct." The result was warmer but still generic. "Match the tone of someone who is confident but not corporate." Better, but it still read like an AI approximating a human approximating professionalism. I went through four rounds of adjusting tone instructions before I stopped and asked a different question: &lt;em&gt;why&lt;/em&gt; is this failing?&lt;/p&gt;

&lt;p&gt;The answer was not in my instructions. It was in my context. I had given the AI her resume and the job posting, but I had not given it anything that showed how she actually communicates. The model had no voice to match, so it defaulted to the genre: cover-letter-ese. The fix was not another tone instruction. It was pasting in a few paragraphs from emails she had written, things where her actual voice came through, and saying "match this register." One change. The next draft sounded like her.&lt;/p&gt;

&lt;p&gt;That experience is the whole article in summary. When a prompt is not working, the instinct is to keep adjusting the instructions. Sometimes that is exactly right. More often, the real fix is somewhere else entirely, and finding it requires a diagnostic approach instead of a guessing one.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this post is and is not
&lt;/h2&gt;

&lt;p&gt;In the first series, we covered &lt;a href="https://purecontext.dev/blog/what-makes-a-good-prompt" rel="noopener noreferrer"&gt;what makes a good prompt&lt;/a&gt;: context, task clarity, format, examples. That post was about composition; how to write a prompt that works. This post picks up where that one leaves off. Your prompt is written. It is not working. Now what?&lt;/p&gt;

&lt;p&gt;Here is the thing worth understanding about where we are right now: models have gotten good enough that prompts rarely produce garbage anymore. The output almost always looks reasonable. The problem has shifted. It is less "this is wrong" and more "this is not what I meant." That subtlety makes debugging harder, not easier, because the failure is easy to miss at first glance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reading the output, not just reacting to it
&lt;/h2&gt;

&lt;p&gt;The first step is the one that gets skipped most often. Before changing anything, read the bad output carefully. Not to judge it. To diagnose it.&lt;/p&gt;

&lt;p&gt;The output is data. It is telling you something about what the model understood, what it prioritized, and where it went off track. A summary that is too long tells you the model did not understand your length constraint, or did not consider it important enough to override its instinct to be thorough. A cover letter that sounds corporate tells you the model defaulted to the genre because you did not provide a voice to match. A code snippet that uses the wrong library tells you the model lacked context about your stack.&lt;/p&gt;

&lt;p&gt;The natural reaction to bad output is "that is wrong." The diagnostic reaction is "that is wrong in a specific way, and the specific way tells me something."&lt;/p&gt;

&lt;h2&gt;
  
  
  Four places a prompt usually breaks
&lt;/h2&gt;

&lt;p&gt;After working through dozens of these debugging cycles, I have found that most prompt failures fall into one of four categories. Knowing which one you are dealing with changes what you do next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Missing context.&lt;/strong&gt; The model does not have the information it needs to do the job. This is the most common failure and the easiest to fix. The cover letter above was a context problem: the AI had qualifications and job requirements, but no sample of the person's actual voice.&lt;/p&gt;

&lt;p&gt;Signs: the output is technically correct but generic. It fills in gaps with reasonable guesses instead of specific details. It sounds like it is writing about your topic from general knowledge rather than from the material you gave it.&lt;/p&gt;

&lt;p&gt;Fix: add the missing context. Sometimes that means providing more input. Sometimes it means restructuring the input you already have so the important parts are easier for the model to find.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Ambiguous instruction.&lt;/strong&gt; The model understood something different from what you meant. This one is sneaky because the output often looks like the model is being difficult when it's actually being literal.&lt;/p&gt;

&lt;p&gt;"Write a short summary" is ambiguous. Short to you might be three sentences. Short to the model might be two paragraphs. "Summarize this in three sentences" is not ambiguous.&lt;/p&gt;

&lt;p&gt;Signs: the output does something coherent but it is not what you wanted. The model made a choice where you expected a specific behavior. If you re-read your prompt and can see two reasonable interpretations of what you asked for, this is probably the problem.&lt;/p&gt;

&lt;p&gt;Fix: replace the ambiguous instruction with a specific one. If you find yourself writing "no, I meant..." in a follow-up message, the original instruction was ambiguous. Rewrite it so the follow-up is unnecessary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Bad format specification.&lt;/strong&gt; The model got the content right but the shape wrong. You wanted a table and got a list. You wanted JSON and got an essay with JSON embedded in it. You wanted three bullet points and got seven.&lt;/p&gt;

&lt;p&gt;We covered in the first series that showing examples is one of the most effective prompting techniques. Format problems are where this pays off the most. A prompt that says "return a markdown table with columns for Name, Action, and Deadline" will usually work. A prompt that says "return a markdown table" and includes a two-row example of the exact table shape will almost always work.&lt;/p&gt;

&lt;p&gt;Signs: the information in the output is correct but the structure is wrong. You are spending time reformatting rather than rewriting.&lt;/p&gt;

&lt;p&gt;Fix: add a concrete example of the desired format, or tighten the format specification until there is only one way to interpret it. This is the fastest of the four to fix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Model limitation.&lt;/strong&gt; The task exceeds what the model can reliably do. This is the rarest of the four, but it is real. Some tasks require capabilities the model does not have: reliable counting, precise arithmetic on large numbers, consistent adherence to complex multi-constraint formatting rules, or knowledge of events after its training cutoff.&lt;/p&gt;

&lt;p&gt;We covered &lt;a href="https://purecontext.dev/blog/hallucinations" rel="noopener noreferrer"&gt;hallucinations&lt;/a&gt; in the first series as one version of this: the model generating confident-sounding information that is not grounded in fact. Model limitation is a broader category. It includes hallucination, but also tasks where the model's architecture makes reliable performance unlikely regardless of how good your prompt is.&lt;/p&gt;

&lt;p&gt;Signs: you have tried multiple clear, well-structured prompts and the output keeps failing in the same fundamental way. The failure is not about clarity or context; it is about capability. Math errors persist even with explicit "show your work" instructions. The model confidently cites a paper that does not exist no matter how you phrase the request.&lt;/p&gt;

&lt;p&gt;Fix: change the approach. Use a calculator for math. Use a search tool for current information. Use code for deterministic logic. These are not tasks that language models are built for; precision and retrieval are not how they work. Understanding that distinction is the real lesson here. Pair the model with tools that cover its weaknesses instead of prompting harder.&lt;/p&gt;

&lt;h2&gt;
  
  
  One variable at a time
&lt;/h2&gt;

&lt;p&gt;Once you have a hypothesis about which category the failure falls into, the temptation is to rewrite the whole prompt. Resist that.&lt;/p&gt;

&lt;p&gt;Change one thing. Run it again. Read the output.&lt;/p&gt;

&lt;p&gt;If the output improves, you found the right variable. If it does not, you learned that variable was not the problem, and you move to the next one. Either way, you have information you did not have before.&lt;/p&gt;

&lt;p&gt;This sounds obvious. In practice it is surprisingly hard to do. When a prompt is frustrating you, the urge to throw it out and start from scratch feels productive. It's not. Starting over resets your experiment. You lose the diagnostic data from the failed version because now you have no idea which of your changes made the difference.&lt;/p&gt;

&lt;p&gt;The best practice is to change one thing, observe, then decide your next move. It is the same loop whether you are debugging code, debugging a prompt, or debugging a recipe. Isolate the variable. Test. Observe.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to stop iterating
&lt;/h2&gt;

&lt;p&gt;There is a point where you should stop tweaking and reconsider the task itself. Say you are on your fifth or sixth revision and each one has made a minor improvement, but it's still not quite right. At this point, you are spending more time on the prompt than you would have spent just doing the task manually.&lt;/p&gt;

&lt;p&gt;That is a signal. Not necessarily that the prompt cannot work, but that the return on further iteration is diminishing. Three things are usually going on:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The task might be too complex for a single prompt.&lt;/strong&gt; Break it into steps. Have the model do part one, review the output, then feed that into part two. Multi-turn design from the &lt;a href="https://purecontext.dev/blog/multi-turn-conversations" rel="noopener noreferrer"&gt;previous post&lt;/a&gt; is the tool here. What cannot work as one prompt often works beautifully as a conversation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The task might be wrong.&lt;/strong&gt; Sometimes what I think I want is not actually what I need. I have spent twenty minutes trying to get a model to rewrite a paragraph in a specific way, then realized the paragraph should just be cut entirely. The prompt was not failing. My framing of the problem was off.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The task might need a different tool.&lt;/strong&gt; Not every problem is a prompt problem. If you need exact formatting, maybe a template with variable substitution is better than asking a model to hit your format precisely. If you need reliable math, use a spreadsheet. AI is powerful for ambiguity, natural language, and judgment calls. It is not always the right tool for precision, determinism, or retrieval.&lt;/p&gt;

&lt;h2&gt;
  
  
  The reflex
&lt;/h2&gt;

&lt;p&gt;The shift this post is really about is small but it changes the whole experience. When a prompt is not working, the instinct might be to brute-force a fix. Add more words. Rephrase. Hope for the best.&lt;/p&gt;

&lt;p&gt;The better reflex is the one developers use when code does not work. Form a hypothesis about why. Test it. Observe the result. Let the result guide the next hypothesis. No guessing, no hoping, just a loop.&lt;/p&gt;

&lt;p&gt;Hypothesis. Test. Observe. Refine.&lt;/p&gt;

&lt;p&gt;It is not more complicated than that. The hard part is not the technique. The hard part is pausing long enough to read the bad output as diagnostic data instead of just reacting to it.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Your prompts are not conversations. They are experiments. Treat them that way.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Next up: what to do when you need your AI to return structured data instead of prose, and why "give me JSON" is almost never enough.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>prompting</category>
      <category>debugging</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Multi-Turn Conversations: Designing a Back-and-Forth</title>
      <dc:creator>Jeff Reese</dc:creator>
      <pubDate>Wed, 22 Apr 2026 05:58:06 +0000</pubDate>
      <link>https://dev.to/jeffreese/multi-turn-conversations-designing-a-back-and-forth-590i</link>
      <guid>https://dev.to/jeffreese/multi-turn-conversations-designing-a-back-and-forth-590i</guid>
      <description>&lt;p&gt;&lt;em&gt;AI in Practice, No Fluff — Day 3/10&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The first time I sat down to &lt;em&gt;design&lt;/em&gt; a conversation instead of just have one, I wrote a single starter message, pasted it into three different tools, and watched each respond to the exact same prompt. My message was quite a bit longer than the ones I usually write. It was very structured with headers and carefully selected context. I also included a short list of constraints I wanted the model to keep in mind through the whole session.&lt;/p&gt;

&lt;p&gt;The first exchange was better than what I usually got after ten.&lt;/p&gt;

&lt;p&gt;That experiment pushed me from thinking of AI as &lt;em&gt;something I talk to&lt;/em&gt; to thinking of it as &lt;em&gt;something I write for&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The mechanic behind the reframe
&lt;/h2&gt;

&lt;p&gt;In the first series we covered context windows: the fixed-size whiteboard an AI works from during a conversation. Series 1 took that mechanic and asked &lt;em&gt;what do you do when it fills up.&lt;/em&gt; This post takes the same mechanic and asks a different question: &lt;em&gt;what would you do differently if you designed the contents on purpose?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A multi-turn conversation is exactly what it sounds like: a back-and-forth where each message you send and each response from the AI counts as one turn. It is helpful to remember that the AI is not remembering your previous messages. The platform is resending them on each turn.&lt;/p&gt;

&lt;p&gt;Every time you hit send, the tool you are using takes your entire conversation history, packs it into a single request, and sends the whole thing to the model. Your first message. The AI's reply. Your second message. The AI's reply. All of it, concatenated, every time. The model reads the whole block, produces the next response, and hands the new entry back to the platform to append.&lt;/p&gt;

&lt;p&gt;There is no stored state on the model side. No session it is tracking. The model sees the exact same thing whether you are on message three or message thirty: one request containing everything that has happened so far, plus your new message at the bottom.&lt;/p&gt;

&lt;p&gt;This is not a quirk of ChatGPT or Claude or any specific product. It is how the underlying API works. The consumer app is doing the bookkeeping of "who said what, in what order" on your behalf, and resending the transcript every call.&lt;/p&gt;

&lt;p&gt;Once you internalize that, the conversation stops looking like a conversation. It starts looking like something else entirely: a document you and the AI are co-editing, append-only, that gets re-read in full before every new line is written.&lt;/p&gt;

&lt;h2&gt;
  
  
  Designing the opener
&lt;/h2&gt;

&lt;p&gt;The first message in any conversation is the most-reread thing in the whole session. It will be read again on turn two, again on turn five, again on turn twenty. Every other message gets read fewer times as newer content pushes it deeper into the transcript, but the opener is always there.&lt;/p&gt;

&lt;p&gt;That changes how I write the first message. When I care about the quality of the whole session, I stop typing casually and start writing a mini-briefing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What I am trying to do (one or two sentences)&lt;/li&gt;
&lt;li&gt;What context the AI needs (the actual relevant background, not everything I know)&lt;/li&gt;
&lt;li&gt;What constraints matter (tone, format, things to prioritize, things to avoid)&lt;/li&gt;
&lt;li&gt;A sample of what good output looks like, when format is specific&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I write it in a text editor, not the chat box. Then I paste it in. The session that follows inherits the document I just wrote as its permanent foundation.&lt;/p&gt;

&lt;p&gt;For work I return to regularly, this opener graduates into a Project. ChatGPT, Claude, and Gemini have all landed on some version of this: a workspace that holds persistent instructions and files alongside multiple chats. Gemini's rollout is the most recent and still surfaces under multiple names (Projects, Notebooks) depending on which product surface you are in. The idea is the same regardless of what a product names it: a folder that holds a persistent set of instructions and files, and every conversation opened inside that folder inherits them without you having to repaste anything. Once the opener is stable, projects promote it from "thing I keep in a text file" to "something every chat in this workspace inherits automatically."&lt;/p&gt;

&lt;p&gt;Sometimes the right opener is an invitation for the AI to interview you. "Ask me five questions before you try to answer X." It is still multi-turn with the same mechanic, just a very different shape: the first few exchanges fill the document with context the model helped shape, not context you wrote alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Branching as a design choice
&lt;/h2&gt;

&lt;p&gt;There is a habit worth picking up: when you are about to shift to a related but distinct task, do not continue the current conversation. Open a new one.&lt;/p&gt;

&lt;p&gt;Not because the current conversation is "full." Because the next task deserves its own transcript. Two related questions will both perform better in their own sessions than if they share a growing combined history. The model stops weighing your architectural discussion from twenty minutes ago against a small refactor question that does not need any of it.&lt;/p&gt;

&lt;p&gt;A conversation for me is usually scoped to a single question or a single task. When the task shifts, I open a new window. The overhead of re-establishing context is small because my opener does most of that work. The savings on model attention are large because the session stays focused.&lt;/p&gt;

&lt;h2&gt;
  
  
  Distillation, not just summarization
&lt;/h2&gt;

&lt;p&gt;A common technique I use is to ask the AI to summarize a long conversation and then use the summary to start a fresh one. Series 1 covered this as drift management. This is a design-level version of the same move.&lt;/p&gt;

&lt;p&gt;When a conversation has done real work, the summary is not just a tool for starting over. It is an artifact. The summary of a session that produced a useful decision is itself a reusable starter message for future sessions in the same territory. You are summarizing to extract.&lt;/p&gt;

&lt;p&gt;Pattern I use:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;At the end of a productive session, ask the AI to produce a structured summary: the decisions made, the constraints, the open questions.&lt;/li&gt;
&lt;li&gt;Spend time reading and editing it; this is the real work of turning a session into reusable context.&lt;/li&gt;
&lt;li&gt;Save it somewhere so you have access to it in the future when it becomes relevant again.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Every good session leaves a distillable residue behind. Treating that residue like an asset, not exhaust, is the move. I use this a lot for capturing decisions in ADRs and for creating guides after I build features.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where consumer memory features fit
&lt;/h2&gt;

&lt;p&gt;Most of the major products now have some form of cross-conversation memory. ChatGPT has a Memory feature that carries useful facts about you between sessions. Other products are rolling out similar capabilities at their own pace.&lt;/p&gt;

&lt;p&gt;These do not change the in-conversation mechanic. Every message in the current chat still resends the full history to the model. The memory layer runs alongside that, injecting persistent facts into the system prompt or as retrieved context. Useful, and a layer above the conversation structure, not a replacement for it.&lt;/p&gt;

&lt;p&gt;If you want stable per-task context, use projects. If you want stable per-user context, use the memory feature. If you want the AI to respond well to what you said two messages ago, do not think of it as remembering at all. Think of the transcript as the document, and design around that.&lt;/p&gt;

&lt;h2&gt;
  
  
  The reflex
&lt;/h2&gt;

&lt;p&gt;The instinct, when an exchange is not going well, is usually to send another message to fix it. Another clarification. Another correction. Another example. That instinct is sometimes right.&lt;/p&gt;

&lt;p&gt;The better reflex, most of the time: stop, close the window, and redesign. Write the starter message you wish you had used. Open a fresh session with it. The minutes you spend on the opener save more than you would lose nudging the current conversation into shape.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The conversation is not a memory. It is a document you are writing. You are the designer.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Tomorrow: when your prompt is almost working but keeps missing in the same way, how to handle it like a bug rather than guessing your way to a fix.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>prompting</category>
      <category>multiturn</category>
      <category>conversations</category>
    </item>
    <item>
      <title>Chain-of-Thought: Teaching AI to Reason Out Loud</title>
      <dc:creator>Jeff Reese</dc:creator>
      <pubDate>Tue, 21 Apr 2026 14:11:40 +0000</pubDate>
      <link>https://dev.to/jeffreese/chain-of-thought-teaching-ai-to-reason-out-loud-153l</link>
      <guid>https://dev.to/jeffreese/chain-of-thought-teaching-ai-to-reason-out-loud-153l</guid>
      <description>&lt;p&gt;&lt;em&gt;AI in Practice, No Fluff — Day 2/10&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When I had just started using ChatGPT, I asked it to calculate how many business days were between two dates. It gave me a confident number. The number was very wrong... I only caught it because I had already done the count by hand and was just double-checking.&lt;/p&gt;

&lt;p&gt;I asked the exact same question again, added four words at the end of the prompt, &lt;em&gt;"Let's think step by step,"&lt;/em&gt; and watched it walk through the weekends, subtract a holiday, and then land on the correct number.&lt;/p&gt;

&lt;p&gt;Same model. Same question. Four extra words. A different answer.&lt;/p&gt;

&lt;p&gt;There is a specific reason for that. The reason matters more in 2026 than the technique does.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is going on under the hood
&lt;/h2&gt;

&lt;p&gt;In the first series we covered the three pieces that make a prompt work: context, task, format. Chain-of-thought is not a fourth piece. It sits on top of those, in the territory of &lt;em&gt;how the AI should think before it responds&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The AI doesn't have a private thinking step. There is no silent internal process happening before it starts writing. Every word it produces is part of the same running output. If you ask for an answer, the first thing it writes is an answer. If you ask it to reason first, the reasoning is the first thing it writes, and the answer comes after.&lt;/p&gt;

&lt;p&gt;The words that come out between your question and the final answer are where reasoning actually lives. These are the &lt;em&gt;intermediate tokens&lt;/em&gt;. They are not a description of thinking. They are the thinking.&lt;/p&gt;

&lt;p&gt;The act of generating "Monday to Monday is 5 business days, subtract the holiday on Thursday, that leaves 4" is the reasoning step itself. Take those tokens away and the thought did not happen.&lt;/p&gt;

&lt;p&gt;That is why "think step by step" is not a magic spell. It is a structural move. You are asking the model to lay down the intermediate computation as written words before committing to an answer, because without those words there is no computation.&lt;/p&gt;

&lt;h2&gt;
  
  
  When it helps
&lt;/h2&gt;

&lt;p&gt;Chain-of-thought earns its tokens on anything that requires more than one step to get right.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Math with multiple operations, especially word problems&lt;/li&gt;
&lt;li&gt;Logic puzzles and constraint-satisfaction problems&lt;/li&gt;
&lt;li&gt;Planning a sequence of actions&lt;/li&gt;
&lt;li&gt;Analyzing tradeoffs between options&lt;/li&gt;
&lt;li&gt;Debugging why some system behaves the way it does&lt;/li&gt;
&lt;li&gt;Any judgment call that depends on comparing several factors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the answer depends on holding more than one fact in mind and combining them, letting the model write out the combination first usually produces a better result than asking it to jump to the answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  When it does not help
&lt;/h2&gt;

&lt;p&gt;Not every problem rewards reasoning out loud. If you are asking the AI to retrieve a single fact, summarize a passage, translate between languages, or pick the correct word from context, there is no multi-step reasoning to surface. You are not asking it to think in parallel about several things; you are asking it to produce one thing. Requesting step-by-step reasoning on a lookup task just generates filler and makes the response longer without making it better.&lt;/p&gt;

&lt;p&gt;It can be worse than neutral. A 2024 paper tested reasoning models on tasks where deliberation pushes the model away from its correct intuitive answer. Forcing chain-of-thought dropped accuracy by more than a third compared to answering directly. Step-by-step reasoning is a tool, not a default setting; on the wrong task it actively hurts.&lt;/p&gt;

&lt;p&gt;A rough test I use: if I would not need to show my work to get credit for the answer, the AI does not need to either.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to structure a chain-of-thought prompt
&lt;/h2&gt;

&lt;p&gt;The simplest pattern is to append "Let's think step by step" to your question. That alone will often flip a wrong answer into a right one. It is the lowest-effort move, and it is often enough on its own.&lt;/p&gt;

&lt;p&gt;For anything more involved, give the model a scaffold. A reliable template is &lt;em&gt;first identify, then determine, then answer&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Question: [your question]

Please solve this by:
1. First, identify what information you have and what you need to find
2. Then, determine the steps required to get from one to the other
3. Work through each step
4. Finally, state your answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The explicit structure does two things. It names the stages so the model is less likely to skip one, and it slows the jump to the answer until the work is done. The "state your answer" as a distinct final step matters. Without it the model sometimes trails off into more reasoning and never commits.&lt;/p&gt;

&lt;h2&gt;
  
  
  The shift with reasoning models
&lt;/h2&gt;

&lt;p&gt;This started shifting in late 2024 with OpenAI's o-series and Anthropic's extended thinking. By 2026 it has flipped all the way. Reasoning is built into the flagship models by default. GPT-5, Claude 4.6, and Gemini 3 all default to reasoning in their main consumer interfaces. Claude's approach, called &lt;em&gt;adaptive thinking&lt;/em&gt;, lets the model decide how much to reason based on the question. You steer how hard it thinks through an &lt;em&gt;effort&lt;/em&gt; parameter in the API rather than setting a token budget by hand.&lt;/p&gt;

&lt;p&gt;If you are using a current flagship, explicit "think step by step" prompting is mostly redundant. A 2025 paper measuring CoT performance across reasoning-class models found the benefit of adding explicit step-by-step prompting is small, and sometimes negative. The reasoning is already happening. You are not unlocking anything the model was not already going to do.&lt;/p&gt;

&lt;p&gt;There is a tradeoff worth knowing: reasoning models are slower and more expensive per response because they are generating a lot of hidden thinking tokens before answering. For simple questions that did not need reasoning, you are paying for thinking that did not improve the output. This is why most providers let you dial the reasoning effort up and down, or offer a non-reasoning mode alongside their reasoning model.&lt;/p&gt;

&lt;p&gt;The practical move: turn the effort up for hard problems, turn it down for easy ones, and if you are on a model without built-in reasoning, reach for explicit chain-of-thought prompting when the problem has more than one step.&lt;/p&gt;

&lt;h2&gt;
  
  
  The reflex
&lt;/h2&gt;

&lt;p&gt;When you get a confident wrong answer from an AI, the reflex is to add more context. More background, more examples, more specificity about what you want.&lt;/p&gt;

&lt;p&gt;That is sometimes the right move. It is often the wrong one.&lt;/p&gt;

&lt;p&gt;The question worth asking first is whether the model had room to think. If the answer depended on more than one step and the model jumped straight to the answer, the failure is structural, not informational. On a non-reasoning model, give the model room to think out loud before answering. On a reasoning model, the reasoning was probably running already; the fix is usually switching the approach rather than adding to the prompt.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If the answer requires thinking, make the thinking happen out loud before the answer.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Tomorrow: one exchange is rarely enough. How to design a back-and-forth with an AI so the conversation does not drift.&lt;/p&gt;




&lt;p&gt;If there is anything I left out or could have explained better, tell me in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>prompting</category>
      <category>reasoning</category>
      <category>chainofthought</category>
    </item>
  </channel>
</rss>
