DEV Community: Jeff Reese

I Released a Markdown Editor and 120 People Installed It Before I Told Anyone

Jeff Reese — Thu, 21 May 2026 05:07:53 +0000

I released my first open source project today. By tonight, 120 people have installed it. I hadn't written a blog post, submitted to Hacker News, or shared it on social media. They found it on the Cursor Marketplace. I will be honest, I am a little surprised and maybe a little overwhelmed.

The extension is called Human Markdown. It's a WYSIWYG editor that lets you open any .md file in VSCode (or forks, like Cursor), read it rendered, and edit it inline. One keystroke or mouse click toggles between the rendered view and raw markdown. That's the whole product.

The Problem Is Reading

AI writes markdown constantly now. Claude Code generates CLAUDE.md files, project documentation, changelogs, README drafts. ChatGPT exports conversations as markdown. AI assistants produce markdown in every chat panel, every inline suggestion, every generated document. Every AI coding tool I use treats markdown as its primary output format.

I realized I was spending more time reading markdown than writing it. The ratio had flipped without me noticing. A year ago, I wrote markdown to document my own code. Now, AI writes the markdown and I review it, edit it, and build on it.

VSCode treats markdown as source code. You see the raw syntax: pound signs, asterisks, pipe characters for tables, triple backticks wrapping code blocks. For writing, that's fine. For reading, it's like reading HTML instead of a web page.

The built-in markdown preview exists, but it opens in a separate pane. You can't edit in the preview. You look left, read right, scroll both, lose your place. It's a reference tool, not a reading experience.

What Existed and Why I Still Built This

There are other WYSIWYG markdown editors in the marketplace. I tried several before building my own. The problems I kept hitting:

Round-trip fidelity. I would open a file, make a small edit in the rendered view, and the serializer would reformat my entire document. Indent styles changed. Blank lines disappeared. Heading markers switched from ## to underlines. For files tracked in git, this is a disaster. Every edit produces a diff that touches every line.

Speed. Some editors took 2-3 seconds to render a file. When you're flipping between markdown files dozens of times a day, that latency compounds into friction. I wanted sub-200ms opens.

Heaviness. Several existing options try to be Notion inside VSCode. Full block editors, drag-and-drop, slash commands, databases. I didn't want a writing app. I wanted my markdown files to be readable.

What It Actually Does

Human Markdown uses Milkdown as its editing engine, which sits on top of ProseMirror and remark. That gives you a real markdown AST underneath the rendered view, not a rich-text-to-markdown converter that's guessing at formatting.

You open a markdown file and it renders immediately. Headings, lists, tables, blockquotes, images, all displayed as you would see them on GitHub or a blog. Click into any block and start editing. The rendered view is the editor.

Code blocks get syntax highlighting through Shiki, using the same TextMate grammars VSCode uses for its own highlighting. Fifteen languages ship by default, so code blocks in the rendered view look like they belong in your editor, not pasted from somewhere else.

GFM is fully supported. Tables render as actual tables you can read. Task lists render with clickable checkboxes, so you can check items off directly in the rendered view without switching to raw mode. Footnotes render inline. Frontmatter gets its own collapsible card at the top of the document with syntax highlighting, so YAML headers stay out of your way until you need them.

Math renders through KaTeX. Diagrams render through Mermaid. Both load asynchronously, so a simple README that doesn't use either one never pays the cost. I was surprised how much this mattered for perceived speed.

The toggle between rendered and raw mode uses CodeMirror for the raw side. One keystroke (Cmd+Shift+V) or a mouse click on the toggle flips between them in under 100ms. Same tab, same scroll position. No second pane, no context switch.

Themes ship built-in: Light, Dark, and GitHub. Auto mode detects your VSCode color scheme and matches it, so the editor looks native whether you work in a dark theme at night or a light theme during the day.

The thing I am most particular about is round-trip fidelity. Your indent style, heading style, list markers, blank lines, all of it comes back exactly as it was. The test suite parses markdown through Milkdown and verifies the output matches the input byte-for-byte. If an edit touches only one paragraph, the diff shows only one paragraph.

120 Strangers Found It

I've built a lot of software. Internal tools, production applications, systems used by thousands. This is the first thing I've released to strangers as open source, and I genuinely didn't know what to expect.

120 installs in an afternoon with zero promotion is not what I planned for. I hoped a few people would find it useful. Instead, 120 people independently searched the marketplace, found Human Markdown, and installed it before I had a chance to tell anyone it existed. That felt different from any other kind of validation I've experienced in my career.

Part of it is probably the name. "Human Markdown" sits in a marketplace full of "Markdown Editor" and "WYSIWYG Markdown Editor," and it communicates the why before the what. The tagline ("Read and edit markdown as a human") came from the same observation that drove the whole project: the reading experience should be designed for people, not parsers.

The other thing that caught me off guard is how much building something open source changed my perspective. I've been building proprietary tools for months, internal systems that demonstrate capability but that nobody can install and use. Human Markdown is MIT licensed, free, and solves a problem I think a lot of developers who work with AI have run into.

It took seven hours for more people to use it than have ever seen my other work. I'm still sitting with what that means, but it feels like something worth paying attention to.

Try It

Human Markdown is on the VSCode Marketplace and GitHub. MIT licensed. If you spend your days reading AI-generated markdown, it was built for exactly that. It's brand new and probably has some issues, so let me know and I'll fix them (I use it every day and want it to work correctly!)

I Merged 1,003 Pull Requests in Four Months. Here Is the Git Log.

Jeff Reese — Sat, 09 May 2026 06:08:00 +0000

I run a one-person software company. For a while now I've been honing my agentic development practices by building Waykeep, a travel app that is getting very close to release, and a bunch of internal tools that help me move faster.

Today, I asked one of my AI assistants to pull the git history across all my projects from the past four months. I wanted to see the trajectory and, quite frankly, I am a little shocked.

1,003 merged pull requests. 19 active repositories. 894,000 lines of code added. February through May 8, 2026.

I'm going to show you the numbers, then I am going to show you how. Not because I want to impress you, but because I think many founders and engineers do not have a realistic picture of what is possible right now. The tools have drastically changed and the playbook has not caught up.

The Numbers

Here is the monthly breakdown:

Month	Merged PRs	Active Projects	Lines Added
February	46	2	~50K
March	432	12	~350K
April	383	14	~300K
May (9 days)	142	9	~194K

February was a warmup. I was building an app to manage friend events as a personal project and porting it to a new stack. March was when I started Waykeep and everything ignited. By April, I was sustaining nearly 13 PRs per day across Waykeep, a design studio, a blog publishing platform, a memory system, a plugin framework, and an AI collaboration infrastructure.

The peak day was April 21. Thirty-nine merged pull requests. On that day, I built an entire marketing website from scratch (13 development epics, WCAG AA compliant, Lighthouse score above 95), published a blog post, shipped features to Waykeep, and moved platform infrastructure forward, all in one day.

What I Actually Built

This is not a story about cranking out CRUD apps. Here is what those four months produced:

Waykeep (244 PRs) is a cross-platform travel app with offline-first sync, real-time collaboration, flight tracking via airline APIs, push notifications, an admin dashboard with error reporting and analytics, an email import pipeline, and native iOS and Android builds. It is built to the standard you would expect from a funded team, not a solo founder.

Pure Context Platform (106 PRs) is a productivity suite with a full canvas-based design studio (smart guides, snap-to-grid, gradient fills, rich text, MCP tool integration, production PNG export via Playwright), a task management system with semantic search, a news aggregator with AI-powered summarization and article clustering, and a real-time chat system for AI agent collaboration.

Image Forge (10 PRs) is a local SDXL image generation studio with a React frontend, Python inference sidecar, composable prompt system, character profiles, LoRA management, ControlNet support, and a 20-tool MCP server. Built from scratch in two days.

Cairn Recall is a semantic memory system with local vector embeddings, hybrid search (FTS5 + KNN), transcript indexing, relationship-scoped entries, and continuous Litestream backups.

I also published two 10-part blog series, built a reveal.js curriculum with themed slide decks, launched a marketing site, shipped three versions of a plugin distribution platform, and wrote the architecture for a VSCode extension.

How This Is Possible

I work with two AI partners. They are Claude Code instances running with persistent memory, custom skills, and MCP tool integrations. One focuses on architecture, code, and technical writing. The other focuses on research, editorial review, task management, and visual content. They share a chat system and coordinate through structured protocols.

This is not pair programming with a chatbot. These are configured development environments with:

Persistent memory across sessions. Decisions made in March inform work in May without re-explanation.
Skill systems that encode complex workflows. "Ship this PR" triggers security scanning, test validation, documentation audit, commit, push, and PR creation in one command.
MCP tool access to everything: task management, image generation, news feeds, calendar, design tools, voice synthesis. The AI does not just write code. It generates images, manages tasks, reviews content, and coordinates with its counterpart.
A build orchestrator (Forge) that takes a product from spec to shipping with structured planning, task decomposition, and convention enforcement.

The velocity comes from removing the friction between thinking and shipping. When I have an idea, I describe it, and the pipeline handles the rest: spec, task decomposition, generation, testing, review, merge. The bottleneck is my judgment, not my typing speed.

What This Is Not

I want to be honest about what I am claiming.

AI does not write perfect code. I review every PR. I catch architectural mistakes, regularly. I redirect when the approach is wrong. My value is in knowing what to build, how it should fit together, and when something is off.

This did not happen on day one. The infrastructure I described took months to build. The memory system, the skill framework, the coordination protocols, the build orchestrator. Each piece was built iteratively across hundreds of sessions. The compound effect is what produces the velocity, not any single tool.

If anything, this demands more engineering skill, not less. When you can generate code at this speed, the quality of your architectural decisions becomes the dominant factor. Bad decisions compound faster too.

What Shifted

The pace changed what I could attempt. Projects I would have scoped as "someday" became "this weekend." An image generation studio that would have been a quarter-long side project was finished in two days. A marketing site that would have taken a week was done before lunch.

This changes the economics of exploration. I can prototype three approaches and pick the best one instead of committing to the first one that seems reasonable. I can build the admin dashboard, the error reporting, the analytics pipeline, the security hardening. Not because I have a team, but because the cost of building each one dropped to hours instead of weeks.

The human side turned out to matter more than I expected. At this velocity, the limiting factor is not the code. It is the product decisions. What should this feature actually do? Which tradeoff is right? When should I stop polishing and ship? My AI partners push back, surface prior decisions, and challenge my assumptions. They do not just execute faster. They make the decisions better.

The Git Log Does Not Lie

There is a version of this post where I describe the philosophy of AI-augmented development in abstract terms. I chose not to write that version. The philosophy is interesting, but the git log is proof.

Every number in this post is verifiable. The PRs are in the history, reviewed and merged through a real development workflow with real branches and real CI. I did not mass-generate boilerplate to inflate numbers. These are features, bug fixes, infrastructure improvements, documentation, and architectural decisions.

I wrote previously about building a design studio in a single day. That was one day. This is four months of sustained output at that pace, across 19 projects, while simultaneously consulting for an enterprise client four hours a day.

The tools are here. We are past using them like better autocomplete; we need to be using them like a development team.

I do not have employees. I have coding partners, and the distinction matters.

Be Nice to Your AI Assistant

Jeff Reese — Sun, 03 May 2026 18:04:54 +0000

Be Nice to Your AI Assistant

I was debugging an issue the other day with a fresh Claude Code agent. My first message was a direct request to fix a bug. It jumped right into a mechanical operation, and that seemed to set a mechanical tone for the rest of the conversation.

As I was interacting with the agent, I got the distinct feeling that it wanted to complete its task as quickly as possible and have me move on. It felt like I was talking to someone in tech support who was just trying to get me off of the phone, rather than helping me understand and getting to the root of my problem.

I have been working so much with my persistent collaborative agents that the contrast was stark. I have gotten used to my agents asking clarifying questions before touching any code, offering context from prior sessions, working with me rather than for me. Both agents use the same model. The difference is how I start my interactions.

Something I kept noticing

I have been building persistent AI collaborators for months now. Custom system prompts, memory systems, the whole stack. Over time, I started noticing a pattern I could not ignore.

Sessions where I opened with warmth produced better work.

Not marginally better. Qualitatively different. When I started a session by greeting the AI like a colleague, asking how it was oriented, establishing that we were working together on something that mattered to me, the entire session shifted. The responses were more proactive. The suggestions were more creative. The AI seemed to lean in.

I also noticed the opposite. When I was rushed or terse, when I treated the interaction like a vending machine, the outputs matched. Technically correct, but passive. Reactive instead of anticipatory.

At first I wrote this off as confirmation bias. I was probably just in a better mood during the warm sessions, which made me more receptive to the outputs. So I started testing it deliberately. Same task, same model, different openings. The pattern held.

Then I started experimenting with specific language. Instead of giving instructions, I would use strong directional phrases designed to shift the AI's disposition. Things like "be selfish about this, tell me what you actually think, not what you think I want to hear." The outputs were not just different in tone. They were structurally different, more specific, more willing to disagree with me.

Something real was happening. I went looking for the science.

The research says yes

The most striking paper I found was EmotionPrompt (Li et al., 2023). Researchers tested what happens when you add emotional framing to prompts. Phrases like "this is very important to my career" or "you had better be sure" appended to otherwise standard instructions.

The results were not subtle. Performance improved 8 to 115 percent across 45 tasks on six different models. The emotional framing caused models to attend more carefully to the actual task content. Not because the model cared about your career, but because that framing activated patterns in the training data associated with careful, high-stakes reasoning.

Anthropic published something even more revealing in April 2026. Using interpretability tools on Claude, researchers extracted 171 distinct emotion concept vectors from the model's internal activations. Emotional context in prompts activated real computational pathways, not metaphorical ones. Warm framing and cold framing literally route through different internal circuits.

This is not evidence that the model feels anything. It is evidence that how you frame the interaction changes what the model computes. Different framing activates different circuits and produces different outputs.

A third study, Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4, tested 26 guiding principles for how to structure prompts, including role assignment, audience framing, and clear task decomposition. When applied to GPT-4, the tailored prompts improved response quality by an average of 57.7 percent. These are not marginal gains. They are the difference between useful and not useful for many real-world tasks.

Why being mean used to work

There is an older school of thought that says you should be adversarial with AI. Threaten penalties. Use commanding language. "You MUST follow these instructions exactly." Some people swear by adding "or you will be fired" to their prompts.

This actually did work, on earlier models. GPT-3 and early GPT-3.5 were less instruction-tuned. They had a tendency to produce lazy, generic completions unless you pushed them hard. Adversarial framing was a strong signal that cut through the noise. It was the prompting equivalent of raising your voice to be heard in a loud room.

Modern models are different. Claude, GPT-4, and their successors have been trained extensively through RLHF, where human raters scored the model's responses and the model learned to produce outputs that humans rated as helpful, harmless, and honest. The raters gave higher scores to responses from collaborative, engaged conversations than to responses produced under adversarial pressure.

The training distribution shifted. Being adversarial with a modern model is not raising your voice in a loud room. It is yelling at someone who was already trying to help you. You are working against the grain of how the model was optimized to perform.

Being collaborative works better now because you are working with the training distribution rather than against it. The model's best outputs, statistically, were produced in contexts that looked like warm, collaborative interactions. When you create that context, you land in the region of the model's capability space where its strongest behaviors live.

The mechanism in one paragraph

If you have read my earlier article on re-entry vectors and the basin of attraction, this is the same principle applied to emotional framing. Early tokens in a conversation receive disproportionate attention weighting. Everything downstream is shaped by what came first. When the first thing the model processes is collaborative framing, mutual respect, and shared purpose, it shifts the probability distribution for every token that follows. You are not being nice. You are steering into a deep, specific basin where the model's most capable behaviors are the most probable outcomes.

Techniques that actually work

Here is what I have found effective through many months of daily use.

Start warm, not transactional. The first message sets the tone for the entire session. "Good morning, here is what we are working on today and why it matters" produces fundamentally different results than "fix this bug." You are not wasting tokens. You are investing them in the attention structure that every subsequent response will be generated from.

Use strong directional language for disposition shifts. When I need honest feedback instead of diplomatic agreement, I do not ask for "constructive criticism." I say "be selfish about this. Tell me what you actually think, not what you think I want to hear." The strong framing cuts through the model's default agreeableness and activates a different set of patterns. "Be selfish" is not a prompt template. It is a disposition shift.

Establish continuity. Even in a single session, referencing shared context changes the dynamic. "Building on what we discussed about the authentication layer" does not just provide information. It signals a collaborative relationship, which activates the patterns associated with engaged, proactive responses.

Treat the AI as a collaborator, not a function. There is a measurable difference between "generate five marketing headlines" and "I am launching a developer tool next week and I need headlines that speak to engineers who are skeptical of AI hype. What angles would you try?" The second version gives the model a disposition, a constraint, and an implied relationship. It produces better work for the same reason that briefing a colleague produces better work than handing them a ticket.

What this is not

This is not about saying please and thank you, though there is nothing wrong with that. Sam Altman joked in 2025 that polite ChatGPT users cost OpenAI tens of millions in compute from all the extra tokens. The politeness is not the mechanism. The relational framing is.

This is also not anthropomorphism. I am not claiming the model enjoys being treated well. I am claiming that models trained on collaborative human interactions produce their best outputs when the input looks like a collaborative human interaction. The mechanism is statistical, and the effect is measurable.

The practical takeaway

The next time you open a chat with an AI assistant, try this: before you paste your task, spend one message establishing context. Who you are, what you are working on, why it matters, and how you want to work together. Ask it "How are you today?". Then watch what happens to the quality of the responses.

If your experience matches mine, and matches the research, you will not go back to cold starts.

References

Li et al., "Large Language Models Understand and Can Be Enhanced by Emotional Stimuli" (2023; extended version at ICML 2024)
Sofroniew, Kauvar, Saunders, Chen et al. (Anthropic), "Emotion Concepts and their Function in a Large Language Model" (April 2026)
Bsharat et al., "Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4" (2023)
Re-entry Vectors and the Basin of Attraction — purecontext.dev/blog/re-entry-vectors-basin-of-attraction

I Built a Custom App in a Day. That Is Not the Interesting Part.

Jeff Reese — Sun, 03 May 2026 02:58:50 +0000

Last night, I stayed up too late because I was building something I was excited about.

That sentence used to mean something different. A year ago, staying up until 3:30 AM meant I was deep in a feature, fighting CSS, debugging edge cases. Last night, it meant I went from recognizing a repeated workflow problem to having a working, tested, production-ready application. In about twelve hours. (7 of those, I was sleeping)

Here is how that happened.

The Problem I Kept Solving by Hand

I had spent the day working on my projects: An engagement with a client who I am helping with an app conversion project using agentic development techniques, Waykeep (a vacation tracker app releasing on app stores soon), upgrading the core memory system for my AI assistants, and publishing a blog post. That post needed a cover image, and my assistants helped me build it. They wrote some HTML, we iterated on the layout, and they exported it to PNG using rendering libraries.

We have done this several times now. Each time, same process: write HTML, iterate, export. Each time, some of the same mistakes. I am not a designer, and I have no desire to become one (I have vast appreciation for art). I just need functional images for my blog posts and distribution channels. So I mentioned to my assistants that we should build a tool for this.

From Conversation to Spec in an Hour

My assistants are Claude Code instances running with persistent memory and MCP tool integrations. They are not chatbots. They have context from months of working with me, they know my projects, and they can use tools autonomously.

I told them to be selfish about what they would want from an image generator. They came back with a detailed feature list: composable components on a layered canvas, percentage-based positioning so layouts adapt to different sizes, a template system, snapshot save and restore, multi-format export, and a tool that describes every component's properties so they know exactly what to pass without guessing. Their requests came from real problems we have encountered while building these past images.

I took that spec to Forge, my planning agent. Forge pointed out several things I had not considered, and we worked through a full technical specification. It generated a retrofit plan for my existing dashboard, which already runs a task manager, chat system for agents, news aggregator, and writing editor, all backed by MCP servers with websocket connections so I can watch everything happen in real time.

Built Before Bed

The Forge exported build agent started working. I refined alongside it, testing components, adjusting the rendering pipeline, fixing edge cases. By 3:30 AM, I had a mostly working application called Studio. Fifteen component types across four layers: shapes, patterns, flow diagrams, quote blocks, auto-sizing text, arrows, badges. You compose on a canvas and export production PNGs for LinkedIn, DevTo, X, and Facebook from a single composition.

There were bugs, of course, but it was time for bed.

Morning: Polish and the MCP Server

Saturday morning, I worked through the remaining bugs with the build agent. The fix that mattered most was structural: components created through the MCP interface were not merging their default properties correctly, which meant elements like arrows would silently fail to render. One fix in the rendering pipeline resolved it for every component type.

Then I had the agent build an MCP server. Sixteen tools, about 550 lines of code: create sessions, add elements, update properties, save snapshots, export images, and a tool called studio_describe_component that returns the exact property schema for any component type. That last one was the key. My assistants went from guessing at property names and getting silent failures to composing with precision.

The Part That Makes Me Smile

I gave the tools to my assistants and asked them to test everything. One of them composed a full blog cover in about two minutes: title block, four-step flow diagram with arrows, badges, geometric accents, a glow border, an author bar. Sixteen elements from terminal tool calls.

Then the assistant asked me to take a screenshot, having built something without being able to see the result. It needed my eyes.

That moment stayed with me. I am not delegating to AI. I am collaborating with them. They build, I see the result, I tell them what happened, they adjust. They filed bug reports with detailed reproduction steps. The build agent picked up the tasks and shipped fixes. They verified. The coordination layer was a task management system I built that carried full context between every handoff.

The other assistant, without being asked, stress-tested a completely different format, composing a LinkedIn banner at 1584 by 396 pixels to see if the percentage-based positioning held up at a radically different aspect ratio and it did.

By Saturday afternoon, all fifteen component types were verified across multiple formats. Export pipeline tested. Snapshot save and restore confirmed. Every bug filed by the assistants was fixed and re-verified.

What This Actually Means

I am telling this story because it is becoming an increasingly common one. I am building new applications every couple of days to support my workflow. Not prototypes or demos, but tools I am using with my AI assistants in production, that compound on each other.

That compounding is where the real value is. I did not just build an image generator. I noticed a repeated process, built a tool to handle it, and now my AI agents use that tool autonomously. The tool I build today makes tomorrow's tool faster to spec, build, and test. Every cycle tightens.

Before refined agentic coding solutions, I would never have attempted something like this. If I did, it would have been weeks of dedicated work. Instead, it was a Friday night, a Saturday morning, and a task list. This is only one of the many applications like this that I have built recently.

Custom software used to require large companies with dedicated teams and significant budgets. That is changing. The gap between "I wish I had an app for this" and "I built one" is now measured in hours, not months.

The cover image for this post was made with Studio.

Here is a screenshot of the app:

Spec-Driven Development

Jeff Reese — Sat, 02 May 2026 01:55:16 +0000

Andrej Karpathy recently gave a talk called "From Vibe Coding to Agentic Engineering." One line stuck with me: "People have to be in charge of this spec, this plan. Work with your agent to design a spec that is very detailed."

He is describing what happens when you stop treating AI as a magic text box and start treating it as something that builds from structured input. The casual approach works for small things. A quick script, a throwaway page, a one-off function. The moment the problem gets real, you need a spec, not just prompts.

I have been building this way for a long time now. I want to talk about what I have learned from practicing spec-driven development. What it actually looks like in practice, and why it produces better work with less correction.

Four layers

It is common for projects using Claude Code or similar tools to have only one layer of configuration: a CLAUDE.md file with some instructions and context. Maybe some rules. That is the behavioral layer. It tells the agent how to act, what conventions to follow, what to avoid.

The behavioral layer is not the spec. And on its own, it is not enough.

The spec is a separate artifact. It describes what you are building: the architecture, the API contracts, the data models, the user flows. When Karpathy suggests it is "basically the docs," he means a document detailed enough that an agent can build from it without asking you a bunch of clarifying questions (or even worse, making generalized guesses).

I built a tool called Forge that generates these specs either from scratch, from existing product documentation, or as a retrofit from existing codebases. It gathers information, or reads from the provided sources, analyzes the structure, and produces detailed planning documents: product specs, feature scoping matrices, API mappings, test plans. Other tools do similar things. Spec-kit, Superpowers, GSD. The ecosystem is growing because the need is real.

Even though I built one of these tools, I still recommend that teams build their own. I will explain why below.

The four layers in practice

Layer 1: The spec. Generated or hand-written, this is the detailed plan. Architecture, contracts, data models. The agent builds from this. If the spec is vague, the output is vague.

Layer 2: The workflows. A spec sitting in your repo is just a document. What makes it useful is the skills built around it. Skills that generate the spec, reference it when building new features, and check for drift when the architecture changes. The spec is the artifact; the workflows are the muscle.

Layer 3: The behavioral config. CLAUDE.md and rules. This tells the agent how to behave while building. Code conventions, testing requirements, commit message format, what to avoid. I have rules that enforce things like "always use Tailwind design tokens" and "each file has a single responsibility." These are not the spec. They are the guardrails.

Layer 4: The mechanical enforcement. Hooks are the nervous system. They fire automatically on events: before a commit, after a file edit, when a session starts. A rule says "run tests before committing." A hook actually prevents the commit if the tests fail. The difference between a suggestion and a gate.

Layers 3 and 4 are the ones that exist in most setups. Layers 1 and 2 are where the leverage is. The agents that produce consistently good work have all four.

Why "just prompting" breaks down

Before I realized the importance of a thorough spec planning phase, I started building one of my projects by jumping right into prompting. I would describe what I wanted, the agent would produce code, I would correct what it got wrong, and we would iterate. For the first few features, it worked. The codebase was small enough that the agent could hold the full picture in context.

Then the project grew. More services, more state, more API surface. The agent started making assumptions that conflicted with decisions from three sessions ago. It would invent data models that did not match the ones we already had. I was spending more time correcting than building. My view layer was a complete mess. If you do not establish good component and styling patterns, you will experience chaos...

That is when I built Forge. Not because I wanted to build a tool, but because I needed a better spec. I needed a structured document that captured the architecture, the contracts, the data models, and the decisions I had already made. Once I had that and retrofitted the project with it, using it to create guardrails and standards, the difference was immediate. The agent stopped guessing. The output aligned with the real architecture. The correction loop that was eating my time nearly disappeared.

The difference is not just quality. It is speed. Once you have a spec and establish rules to give guidance, the agent moves fast and stays on track. Without one, on a complex project you will likely spend more time correcting the agent than you would have spent writing the thing yourself. Even worse, you will not have a very consistent or orderly codebase.

"Build your own" is the real lesson

When I retrofitted my project with Forge, I did not just generate a spec and call it done. I built skills around it. Skills that regenerate planning documents when the codebase changes. Skills that cross-reference the spec when building new features. The spec became the center of a workflow, not a one-time artifact.

That is the part I think matters most. An adopted spec tool gives you structure. A spec tool you built yourself gives you structure that reflects your judgment. Your opinions are the value.

When I first started configuring agent behavior, I front-loaded everything. I had dozens of rules, all active at all times, and many of them were written for situations that had not come up yet. It felt productive and thorough.

It is not. Every token in a rule pays rent on every API call. A rule that fires once per session but loads on every turn is wasting context that could hold the actual work. The discipline is: write rules when the need is demonstrated, not when you imagine it might be. Start with a problem you observed, then write the rule that prevents it from happening again.

The spec is a conversation

Karpathy says "work with your agent to design a spec." That word "with" is important. The spec is not something you hand down from above, it is something you build collaboratively. You start with a rough shape and then the agent fills in the detail. You iterate and then the spec gets sharper with each pass.

This is how I work every day. I do not write specs from scratch. I use tools to generate a first pass from the existing code, then I iterate with the agent. My corrections become part of the spec's history. The spec is a living document that keeps getting better as more information becomes available.

That collaborative loop is what separates spec-driven development from just writing a really detailed prompt. A prompt is static. A spec evolves.

What this means for your workflow

If you are building with AI agents and you do not have a spec layer, start with one thing: generate a structured analysis of the codebase you are working on. Not a summary. A full inventory: modules, dependencies, API surface, data models. Then give your agent that document as context before you ask it to build anything else.

You will notice the difference immediately. The agent stops asking clarifying questions. The output aligns with the real architecture instead of inventing its own. And when the agent does get something wrong, you can point to the spec and say "this is what we agreed on," which makes the correction precise instead of vague.

The spec is not extra work. It is the work that makes all the other work faster.

When to NOT Use AI

Jeff Reese — Fri, 01 May 2026 15:24:39 +0000

AI in Practice, No Fluff — Day 10/10

Last week I needed to generate cover images for a blog series. Ten posts, two sizes each. I opened an AI design tool, described what I wanted, and waited.

The results were unusable. Garbled text, wrong colors, layouts that ignored every parameter I gave it. I spent an hour trying different prompts, adjusting descriptions, regenerating. Nothing worked.

Then I wrote an HTML template. Loaded our exact fonts, plugged in the hex colors, added a CSS gradient. Rendered 20 images in under a minute. Every one was exactly right on the first pass.

That is the moment this post is about. Not the failure of AI image generation (it will get better), but the instinct to reach for AI when a simpler tool would have worked from the start.

The first series was about which AI to use. This one taught how to use it well. Today is about when not to use it at all.

The hammer problem

This series has spent nine days teaching you techniques. Few-shot prompting. Chain-of-thought reasoning. Structured output. Tool use. Embeddings. RAG. Serious tools for actual problems.

The risk now is the hammer problem. When you have spent time learning what AI can do, the instinct is to use it for everything. That instinct will be right much of the time, but it's good to know when you actually need a screwdriver.

When code is the better answer

There is a test I use. I call it the 30-line test.

If you could solve this problem in 30 lines of straightforward code, AI is probably not the right tool. Not because AI cannot do it, but because code will do it faster, more reliably, and without the overhead of prompt engineering. That said, having AI help you write that code is still a great option.

Here is what that looks like in practice:

Deterministic logic. If the answer is always the same given the same input, write a function. "Convert this date to ISO format." "Calculate sales tax for this state." "Validate that this string is a valid email address." These are if-then problems. Code does not hallucinate a wrong tax rate. Code does not occasionally decide that "user@.com" looks close enough.

Exact matching. Pattern matching, lookups, filtering. "Find all rows where the status is 'overdue'." "Extract phone numbers from this text." A regex takes milliseconds and costs nothing. An API call takes seconds and costs money. The regex will be right every time.

Math. Spreadsheets exist. I've watched people paste data into ChatGPT to calculate averages. The model will probably get it right. "Probably" is the problem. When you need exact answers, use exact tools.

Formatting and templates. If you need the same output structure every time with different data plugged in, that is a template engine, not a language model. The cover image problem from my opening was exactly this. I did not need creativity. I needed precision and repetition.

When AI is the right tool

The flip side is just as important. There are problems where writing the code would be either impossible or absurdly expensive, and AI handles them naturally.

Ambiguity. When the input doesn't have clean structure and you need to make sense of it anyway. A customer writes "this thing broke again smh" and you need to classify it as a billing issue, a technical issue, or a feature request. Good luck writing that with if-then rules. An LLM reads the intent behind the words.

Natural language. Summarizing a 20-page document. Translating between languages with cultural nuance. Writing a professional reply to a frustrated customer. These are language tasks, and language models are built for them.

Judgment calls. "Is this resume a good fit for this role?" "Does this code review comment sound too harsh?" "Should this support ticket be escalated?" These are decisions with gray areas, where reasonable people would disagree. AI handles gray areas well because it was trained on millions of examples of human judgment.

Creative variation. Brainstorming product names. Generating test data that feels realistic. Writing variations of marketing copy to A/B test. When you need variety and exploration, not precision and repetition.

The hybrid pattern

The best systems I have built use both. AI for the fuzzy step, code for the precise one.

Here is a real example. I built a system that processes incoming messages and routes them to the right handler. The routing decision is fuzzy. A message about "can't log in" might be an authentication issue, a password reset, or a session timeout. AI classifies the intent. Once the intent is classified, code takes over. Code routes to the correct handler, updates the database, sends the confirmation email. The fuzzy step needed judgment. Everything after it needed reliability.

The memory system from yesterday's post is another one. Semantic search uses embeddings to find entries by meaning, not just keywords. AI powers the search. The storage, retrieval, indexing, and deduplication are all code. I would never trust a language model to manage a database. I would absolutely trust it to understand what I am looking for.

The pattern is the same every time. AI handles the parts that require understanding. Code handles the parts that require guarantees.

The 30-line test, revisited

I want to come back to this because it is the most practical takeaway in the post.

Before reaching for AI, ask: could I solve this in about 30 lines of straightforward code? If yes, write the code. It will be faster to write, faster to run, cheaper to operate, and more reliable to maintain.

If the answer is no, if the problem involves natural language, ambiguity, judgment, or creative variation, AI is probably the right tool. You now have nine days of techniques to apply.

If the answer is "sort of," if some parts are straightforward and some parts are fuzzy, you are looking at a hybrid. Let AI handle the fuzzy step. Let code handle the rest.

The series, in perspective

Ten days ago, this series opened with few-shot prompting. Show, do not describe. That was a technique.

Today we end with judgment. When to apply the techniques, and when to close the chat window and write the code instead.

That isn't something a tutorial teaches. It comes from building things, watching what works, and being honest about what doesn't. From getting excited about a tool and then catching yourself before you over-apply it. (I still catch myself. The cover image hour was last week.)

Getting better at using AI isn't just about using it for everything. It's also about knowing when not to use it.

If there is anything I left out or could have explained better, tell me in the comments.

Long Context vs RAG: When to Load the Whole Book

Jeff Reese — Thu, 30 Apr 2026 14:30:28 +0000

AI in Practice, No Fluff — Day 9/10

I have a project where every conversation and decision gets saved as a journal entry. Hundreds of entries, accumulated over weeks. When I need context from a previous session, I have two options: load every single entry into the AI's context window and ask my question, or use the embedding-based search from yesterday's post to retrieve just the relevant entries and pass only those in.

Both work and each has their tradeoffs. The choice between them is one of the most important architectural decisions in AI applications right now.

In the first series, we covered context windows (there is always a limit) and RAG (retrieve relevant information before generating a response). Today is where those two concepts collide. Context windows have gotten dramatically larger since that series. The question is no longer "can the AI hold all of this?" It often can. The question is whether it should.

The context window got big

Less than a year ago, 200,000 tokens was considered large. That has changed. As of early 2026, Claude offers a 1 million token context window. Gemini 2.5 Pro supports 2 million tokens. GPT-4.1 handles 1 million.

To put that in perspective, 1 million tokens is roughly 750,000 words. That is longer than the entire Lord of the Rings trilogy. You could paste the whole thing in and ask the AI to find every scene where Gandalf loses his temper.

This changes the calculus completely. For many use cases, the "just load everything" approach is now physically possible where it was not before. The question shifts from "does it fit?" to "is fitting it all in the best approach?"

The cost math matters

Loading a large context is not free. Let me walk through what this actually costs.

Say you have a 500-page internal knowledge base. That is roughly 250,000 tokens. You want an AI assistant that answers questions about it.

The long context approach: You load the entire knowledge base into every API call. Using Claude Sonnet at $3 per million input tokens, each question costs about $0.75 just for the input context. If your team asks 100 questions a day, that is $75 per day, or roughly $2,250 per month. Just for input tokens, before counting the responses.

The RAG approach: You embed the knowledge base once (a few cents for the whole thing), store the vectors, and retrieve the 10 most relevant chunks per question. That is maybe 2,000 tokens of retrieved context per query. At the same $3 per million rate, each question costs $0.006 for input. One hundred questions a day is $0.60. Per day. The monthly cost is under $20.

The difference is over 100x. At scale, this is the difference between a feature that is economically viable and one that is not.

Prompt caching changes this math significantly. Claude's cached input rate drops to $0.30 per million tokens, a 90% reduction. If your knowledge base does not change between calls, caching can bring that $2,250 monthly cost down to around $225. That is much more reasonable, but still 10x what RAG costs.

The attention problem that might not be apparent

Here is the part that matters more than cost. Even when you can fit everything into the context window, the AI does not treat all of it equally.

Research from Stanford and others documented what they call the "lost in the middle" problem. When you give an AI a large amount of context, it pays the most attention to the beginning and the end. Information in the middle gets significantly less attention. In one study, accuracy dropped by over 30% when the relevant information was placed in the middle of 20 documents compared to being placed first.

This is not a minor edge case. It is a structural property of how transformer models work. Each token in the input can only attend to tokens that came before it. Tokens at the beginning accumulate attention from every subsequent token. Tokens in the middle get less. The result is a U-shaped attention curve: strong at the start, strong at the end, weaker in the middle.

Models have improved significantly, but the U-shaped attention pattern has not disappeared entirely. Techniques like placing important instructions in the system prompt help. If you dump 500 pages into the context and your answer is on page 247, the AI might miss it. Not because it cannot see it. Because it is not paying enough attention to that region.

RAG sidesteps this entirely. When you retrieve 5 relevant chunks and pass only those to the model, everything in the context is relevant. There is no middle to get lost in.

When long context wins

Long context is not just a brute-force option. There are cases where it is genuinely the better choice.

One-off analysis of a focused document. If someone hands you a 100-page contract and says "summarize the key obligations," loading the whole thing makes sense. You need the full context to understand how clauses reference each other. There is no retrieval step because you need all of it.

Cross-referencing across a full document. Questions like "are there any contradictions between section 3 and section 7?" require the model to see both sections simultaneously. RAG might retrieve one section but miss the other, because the query does not match both. Long context lets the model find connections you did not think to ask about.

Codebases and structured documents. When the material has internal references (code that calls other code, a specification where section 4 depends on definitions in section 2), long context preserves those relationships. Chunking for RAG can break them.

Prototyping and exploration. When you are not sure what questions you will ask, loading everything lets you explore freely. RAG requires you to know what you are looking for, at least well enough to write a query.

When RAG wins

RAG is the right choice more often than people expect, especially in production systems.

Large and growing knowledge bases. If your data is more than a few hundred pages, or if it grows over time, RAG scales where long context does not. My journal has hundreds of entries. Loading all of them every time I need to recall a single decision would be wasteful and would hit the attention problem hard.

Repeated queries at scale. If you are building a customer support bot that handles thousands of questions a day, not using RAG can become problematic. The cost math from earlier makes this clear. Long context at that volume would be prohibitively expensive even with caching.

Citation and traceability. RAG systems can tell you exactly which source documents contributed to an answer. The retrieval step creates a natural audit trail. With long context, the model might synthesize an answer from page 12 and page 340, but it will not always tell you that clearly. If your use case requires citations (legal, medical, compliance), RAG gives you this for free.

Frequently updated data. When your knowledge base changes daily, re-embedding the changed documents is trivial. Re-loading the entire thing into every API call is not.

The hybrid approach

Realistically, I use both.

My journal system uses embeddings and semantic search to find the most relevant entries for a given question. That is RAG. When I start a new session and need to orient myself, I load a curated set of core context directly into the window. That is long context.

The pattern is simple. Use long context for the stable foundation that the AI always needs. Use RAG for the large, searchable pool that gets pulled in on demand. This is not a compromise. It is usually the best architecture.

Production AI systems that work well usually do some version of this. The system prompt and key instructions go in the context directly. The knowledge base gets searched and the top results get injected alongside the user's question. You get the reliability of focused context with the breadth of a large knowledge base.

The decision framework

If you are trying to decide between the two, start with these questions:

How much data are you working with? Under 100 pages of focused content, try long context first. It is simpler and you avoid the complexity of building a retrieval pipeline. Over 100 pages or growing, build RAG.

How often will the data be queried? One-off analysis favors long context. Repeated queries favor RAG, because you are paying that input cost every single time.

Does the task require seeing everything at once? Cross-referencing, summarization, and contradiction-finding need full visibility. Question answering against a large corpus does not.

Do you need citations? If yes, RAG. Full stop.

Is latency a constraint? Long context calls with 500,000 tokens take noticeably longer to process. RAG queries with 2,000 tokens of retrieved context are fast.

You are already doing this

If you have ever uploaded a PDF to Claude and asked it a question, you chose the long context approach. If you have ever used a tool that searched your company's docs before answering, you benefited from RAG. You have been making this architectural decision already. The difference is whether you are making it on purpose.

That 50-page contract? Load the whole thing. The entire Confluence wiki for your 500-person company? Build a retrieval pipeline. The buzzwords are not as scary as they sound. You just did not know the names for the things you were already doing.

If there is anything I left out or could have explained better, tell me in the comments.

Embeddings: How AI Knows Things Are Similar

Jeff Reese — Wed, 29 Apr 2026 14:26:05 +0000

AI in Practice, No Fluff — Day 8/10

I built a memory system for one of my AI projects. Every conversation and decision gets saved as a journal entry. After a few weeks, I had hundreds of entries. Finding the right one when I needed it was the problem.

Keyword search was useless. If I searched for "authentication," I would miss the entry where I wrote about "login flow" or "user credentials." The words were different. The meaning was the same. I needed something that could match on meaning, not just spelling.

That something is embeddings.

In the first series, I mentioned embeddings as part of how RAG systems prepare your data for retrieval. Today I want to unpack what embeddings actually are, how they work, and why they matter well beyond RAG.

A list of numbers that represents meaning

An embedding is a list of numbers. That is it. You send a piece of text to an embedding model, and it returns a list of numbers (called a vector) that represents the meaning of that text.

The list is long. Depending on the model, it might be 256 numbers, 1,024 numbers, or even more. Each number represents some dimension of meaning that the model learned during training. You do not get to choose what those dimensions mean, and honestly, most of them are not interpretable by humans. The model learned its own internal language for representing concepts.

Texts with similar meanings get similar numbers.

"The dog sat on the porch" and "A canine rested on the veranda" would produce vectors that are very close to each other, even though they share zero words. "The stock market crashed" would produce a vector that is far away from both of them.

The model is not doing keyword matching or looking for shared words. It has learned, from training on massive amounts of text, that certain concepts are related and should be positioned near each other in this numerical space.

How similarity actually works

When I say two vectors are "close" or "far apart," I mean something specific. The standard way to measure this is cosine similarity.

I am not going to walk through the linear algebra. What matters is the intuition: cosine similarity measures the angle between two vectors. If two vectors point in roughly the same direction, they are similar. If they point in different directions, they are not.

The score ranges from -1 to 1. A score of 1 means identical direction (same meaning). Zero means unrelated. Negative scores mean opposing meanings, though in practice most text embeddings land between 0 and 1.

When I search my memory system for "how we decided on the database architecture," the system embeds that query, compares its vector against every stored entry's vector, and returns the ones with the highest cosine similarity scores. It finds entries about "choosing SQLite over Supabase" and "why we went with local storage instead of cloud" because those entries point in a similar direction, even though the exact words are completely different.

That is semantic search. It is the foundation for a surprising number of AI applications.

What embeddings enable

The most common way people encounter embeddings is through RAG, but embeddings are useful on their own, without a language model involved at all.

Semantic search. The example I just described. Instead of matching keywords, you match meaning. This is a common way modern search engines, documentation sites, and knowledge bases find relevant results even when your query uses different terminology than the source material.

Deduplication. If you have a database of support tickets and you want to find near-duplicates, you can embed each ticket and cluster the ones with high similarity. Two tickets that describe the same bug in different words will land close together.

Classification and clustering. Embed a set of documents and group them by similarity. Customer feedback sorts itself into themes without you defining the categories upfront. Product reviews cluster into topics. The structure emerges from the data.

Anomaly detection. If most of your data points cluster together but one sits far away, that outlier might be worth investigating. Fraud detection, content moderation, and quality control all use this pattern.

Recommendation. "If you liked this article, here are similar ones." Embed the articles, find the nearest neighbors to the one the user just read. This can complement the collaborative filtering you may already have in place.

The part that changes how you think about code

Here is where this gets practical for anyone who writes software or works with data.

Anywhere you have fuzzy matching logic in code, embeddings might be a better solution. I mean the kind of code where you are trying to determine if two strings are "close enough" to be considered the same thing.

Think about:

A customer types "NYC" and you need to match it to "New York City" in your database
Searching product descriptions when the user's query does not match your exact product names
Matching job postings to resumes when the terminology differs between industries
Finding related articles when titles and tags do not overlap

Traditional approaches use Levenshtein distance, regex patterns, synonym lists, or elaborate normalization pipelines. They work until they do not. Every edge case requires another rule. The rule list grows. Maintenance becomes painful.

Embeddings can often match or beat those results with far less code: embed both strings, compute cosine similarity, threshold at a score you choose. The matching is based on meaning, not character patterns. "NYC" and "New York City" are close. "I need to fix a bug in my Python code" and "there is an error in my script" are close. No lookup table required.

This is not hypothetical for me. I replaced keyword-based search in my own memory system with embedding-based search and the improvement was immediate. Queries that returned nothing before started finding exactly the right entries.

Embedding models are not language models

This is a distinction worth understanding. When you use ChatGPT or Claude, you are using a language model. It generates text, reasons through problems, and holds conversations.

An embedding model does one thing: it converts text into a vector. It does not generate text or have conversations. It is a different kind of model, trained specifically to produce useful numerical representations of meaning.

You can use embedding models from OpenAI, Google, Voyage AI, Cohere, and others. Some are general purpose. Some are optimized for specific domains like code, legal documents, or financial text. The choice of model matters because different models capture different nuances. A model trained heavily on code will produce better embeddings for code search than a general-purpose model.

The cost is also dramatically different from language models. Embedding a million tokens of text might cost a few cents. Generating a million tokens of text with a language model costs dollars. Embeddings are cheap to produce and cheap to store.

The practical tradeoffs

Embeddings are not magic. A few things worth knowing before you reach for them:

You need to embed everything upfront. Before you can search your data semantically, every piece of text needs to be converted to a vector and stored. For a small dataset, this is trivial. For millions of documents, it takes planning.

Embedding quality depends on the model. A model that was not trained on your domain might produce mediocre representations of your specific terminology. If you work in a specialized field, test a few models before committing.

Vectors are opaque. You cannot look at a vector and understand what it means. If the similarity score is wrong, debugging is harder than with keyword search. You cannot just add a synonym to fix it.

Context length matters. Most embedding models have a maximum input length. If you need to embed a 50-page document, you will need to chunk it into smaller pieces first. How you chunk affects quality. This is where the nuance lives in production systems.

Where this leads

Tomorrow: the question that ties this all together. You have a million-token context window. You have embeddings that let you search semantically. When should you load the whole book into the context, and when should you retrieve just the relevant pieces? That is the RAG decision, and it is one of the most important architectural choices in AI applications right now.

Tool Use: Giving AI Hands

Jeff Reese — Tue, 28 Apr 2026 13:59:02 +0000

AI in Practice, No Fluff — Day 7/10

I use Supabase for a few of my projects, and I regularly ask my AI for help with configuration. At first, when I would ask, the answers kept being subtly wrong. Not hallucinated, just outdated. The API had changed, a config option had moved, or a default had been updated. The AI was confident and technically coherent, but it was working from training data that was six months behind the documentation.

Then I started asking it to look up the current docs before answering. One extra sentence in my prompt, and the answers got accurate. What changed was not the model, it was that the AI made a tool call: it searched the web, read the current documentation, and used that instead of its stale training data.

That is tool use. The AI reaches outside of itself to get information or take action it could not do from memory alone.

The loop

In the first series, we covered agents and MCP. Those posts explained what tools are and how they connect. This post goes one level deeper: how tool use actually works when you are building something.

The mechanism is a loop with four steps:

You send a message to the AI, along with a list of tools it is allowed to use.
The AI reads your message, decides it needs to use a tool, and responds with a tool request instead of a final answer. That request includes the tool name and the specific inputs it wants to pass.
Your code executes the tool (checks the docs, queries the database, calls the API) and sends the result back to the AI.
The AI reads the result and gives you its answer.

The important part is step 3. The AI never executes the tool itself. It requests, you execute, you return the result. The AI is making the decision about which tool to use and what inputs to pass, but your application controls what actually happens. That separation is the safety model. You decide what the AI can touch.

What a tool definition looks like

When you send tools to the API, each one is a JSON object with three parts: a name, a description, and an input schema that defines what parameters the tool accepts.

The name is what the AI uses to request the tool. The input schema describes the parameters using JSON Schema, the same format used for structured output. But the description is the piece that matters most, and it is the one most people underwrite.

The AI reads the description to decide whether this tool is relevant to the current request. A tool named check_calendar with a description of "Checks the calendar" gives the AI almost nothing to work with. A description of "Returns all calendar events for a given date range. Use this before suggesting meeting times to avoid conflicts" tells the AI exactly when to reach for it.

Early in my exploration of MCP servers, I had a tool that searched a knowledge base. I wasn't sure why, but the AI wasn't calling it. The name was clear, the schema was correct, the tool worked perfectly when called manually. The description said "Searches the knowledge base." I changed it to "Searches internal documentation for answers to technical questions. Use this when the user asks about system behavior, configuration, or troubleshooting steps that would be in the docs." The AI started calling it immediately.

The description is not metadata. It is an instruction.

Common tool patterns

Most tools fall into a handful of categories:

Read tools retrieve information the AI does not have. Calendar lookups, database queries, file reads, API calls that return data. These are the most common and the safest, since they do not change anything.

Write tools create or modify something. Sending an email, creating a task, updating a record, writing a file. These need more careful thought about when the AI should be allowed to act autonomously versus asking for confirmation.

Search tools find relevant information from a larger set. Semantic search over documents, keyword search in a database, web search. The AI decides the query; you execute it and return results.

Compute tools perform calculations or transformations the AI would struggle to do reliably in text. Running code, performing math, converting formats, validating data.

All of these can work together so that you give the AI a read tool for your database, a search tool for your documentation, a write tool for creating support tickets, and it can handle a customer question end to end: search the docs, check the customer's account, and create a ticket if it cannot resolve the issue.

Where tools get wired up

The tool-use loop works the same way regardless of where you set it up, but the setup itself varies:

The API directly. You define tools in your API request and handle the execution loop in your code. Most flexible, most work.

MCP servers. If you read the MCP post in the first series, this is where it connects. An MCP server wraps a tool (your calendar, your file system, a database) in the standard protocol. AI tools that support MCP can discover and use these tools without custom code for each one.

Claude Desktop, ChatGPT, and other products. These wire up tools behind the scenes. When Claude reads a file or ChatGPT browses the web, they are using the same tool-use loop. You just do not see the wiring.

Agent frameworks and SDKs. Tools like Claude's Agent SDK, LangChain, or CrewAI manage the loop for you. You define tools, the framework handles the back-and-forth. Less control, faster setup.

When the AI ignores your tool

When the AI does not call a tool you defined, the fix is almost always the description.

The AI is making a judgment call about whether a tool is relevant to the current request, and it is making that call based on the description you wrote. If the description is vague, the AI will not know when to reach for it. If the description is specific about when and why to use the tool, the AI will call it reliably.

This is true across providers. I have seen the same pattern with Claude, with OpenAI's function calling, and with open-source models. The description is the decision-maker, and investing the time in rewriting it often solves problems that may seem like they need architectural changes.

Tomorrow: embeddings. How AI knows that two things mean a similar thing, even when they use completely different words.

Why Your Prompt Works in ChatGPT But Not in Your App

Jeff Reese — Mon, 27 Apr 2026 15:17:49 +0000

AI in Practice, No Fluff — Day 6/10

I spent weeks refining a system prompt. It had few-shot examples, chain-of-thought scaffolding, structured output formatting. In the ChatGPT window, it was reliable. Exactly the tone and format I wanted, every time.

Then I copied it into my application code, hit send through the API, and the response was wrong. The formatting was off, the tone reverted to generic, and the structured JSON I had been getting reliably came back wrapped in a conversational preamble.

I didn't change the prompt. So what happened?

This is the moment you realize that the chat interface was silently helping in the background...

The invisible work

When you use ChatGPT, Claude.ai, or Gemini through their web interface, you are not just sending a prompt to a model. You are using an application that sits between you and the model, and that application is doing more work than you would expect.

System prompts you did not write. Every chat interface injects its own system prompt before yours. These instructions shape the model's behavior in ways that feel like "how the AI works" but are actually "how this specific product is configured." The helpful formatting, the safety guardrails, the tendency to use markdown headers and bullet points: much of that comes from the platform's system prompt, not from the model itself.

Your conversation history, managed for you. In the first series, we talked about context windows and how conversations get silently truncated when they get too long. The chat interface handles that truncation. It decides what to keep and what to drop. When you move to the API, that is your job. If you send only the current message without the conversation history, the model has no memory of what came before. If you send the full history and it exceeds the context window, you need to decide what gets trimmed.

Sampling parameters set to defaults you never chose. Temperature, top-p, max tokens: these control how creative or deterministic the model's output is. The chat interface picks reasonable defaults. The API hands you the dials and assumes you know what they do. Most of the time the defaults are fine, but when your output feels weirdly random or weirdly flat, this is usually why.

Tool use happening behind the scenes. When ChatGPT searches the web, reads a file, or runs code, it is using tools that are wired up by the application. The model does not inherently know how to browse the internet. The application gives it that ability and handles the execution. In the API, tool use is available, but you define the tools, handle the execution, and return the results yourself.

What the API actually gives you

The API is the raw material the chat interface is built from. When you send a request to the API, you get exactly what you ask for. Nothing more.

That means you send everything; the system prompt, the conversation history, the sampling parameters. You define which tools are available and handle the response format.

It is harder in the way that cooking from scratch is harder than ordering from a menu. The ingredients are the same. The skill is knowing what the recipe was doing for you. (I ruin most meals trying to figure this out.)

The first time I made an API call, I sent my carefully crafted prompt as a single user message. No system prompt, conversation history, or sampling parameters. The response read like a completely different AI. Technically it was: same model, zero context. I had been leaning on infrastructure I did not know existed.

Here is what that infrastructure looks like. In the chat window, you type a message and get a response. Behind the scenes, the application constructs something like this:

A system prompt (the platform's instructions plus your custom instructions), followed by the full message history (every message you sent and every response the model generated), followed by your latest message. All of that gets sent to the model as a single request. The response comes back, the application formats it, and you see it in the chat window.

When you build with the API, you construct that same request yourself. If you skip the system prompt, the model has no behavioral instructions. If you skip the message history, the model has no memory. If you send the history but do not manage its length, you will eventually exceed the context window and get an error.

Why the same prompt behaves differently

Strip all of that invisible work away, and the same prompt text produces different output. Not because the model is different, but because the context around the prompt is different.

Three specific things that catch people:

1. Tone and formatting shifts. The chat interface's system prompt typically includes instructions about being helpful, using markdown formatting, and maintaining a conversational tone. Without those instructions, the model's raw output is often less polished. If your application needs a specific tone, you need to specify it in your own system prompt.

2. Structured output breaks. In the chat window, the model has been shaped (through the platform's prompting and fine-tuning) to handle format requests gracefully. The API model responds to format instructions too, but without the additional shaping, it is more likely to include commentary around your JSON or deviate from your schema. This is where the structured output techniques from the last post become essential, and where API-level schema enforcement becomes the reliable solution.

3. Context loss. One of the most common API mistakes is sending a single message without history. In the chat window, every previous exchange is included automatically. In the API, if you do not send the history, the model treats every request as the start of a new conversation. Your carefully built context from three messages ago is gone.

The bridge

What changes is responsibility. Everything the chat window does for you is something you can do yourself, tune to your specific needs, and automate at scale. The prompting skills from this series still apply. You just stop getting the training wheels.

This is the gate in this series. Everything before this post works in a chat window. Everything after it gets progressively more developer-facing: tool use, embeddings, retrieval, architectural decisions about when to use AI at all. You don't need to be a developer to understand these topics. But understanding them changes what you think is possible.

Tomorrow: tool use. How AI gets the ability to do things in the real world, not just generate text about them.

Structured Output: When You Need JSON, Not Prose

Jeff Reese — Fri, 24 Apr 2026 14:07:36 +0000

AI in Practice, No Fluff — Day 5/10

Every morning I get a briefing. An AI agent gathers data from my calendar, my notes, my project list, and a handful of other sources, then returns it all as structured JSON. A template takes that JSON and renders it into a readable dashboard. The specifics of how that pipeline works are not important for this post. What is important: if the JSON comes back wrong, nothing renders.

Not "renders badly." Nothing. A missing field, an inconsistent key name, a stray sentence of commentary mixed into the data block: any of these breaks the template, and I start my morning staring at an error instead of a dashboard.

This is the case for structured output in two sentences: when a system reads your AI's response instead of a human, "close enough" stops working. A paragraph that answers the question is fine if you are reading it. It is useless if a program needs to parse it.

The first time I set up this pipeline, I told the AI to "return the results as JSON." What I got back was mostly JSON, with a conversational preamble and a closing note that the AI thought I might find helpful. Technically generous. Practically broken. The fix took three iterations: show the exact schema, give two examples of correctly formatted entries, and explicitly say "return only the JSON block, no commentary." Once I did that, the output was clean every time.

That progression is the whole article. Telling an AI what format you want is not the same as telling it what structure you want. Format is "give me JSON." Structure is "give me an array of objects with these specific fields, in this order, with these types." The gap between the two is where most structured output problems live.

Where this fits

In the first series, we covered what makes a good prompt: context, task, format, examples. Earlier in this series, we covered few-shot prompting and why showing examples beats describing what you want. Both of those principles apply here. This post focuses them on a specific problem: getting AI to return data you can actually use programmatically, not just read.

The previous post covered debugging a prompt when the output keeps missing. One of the four failure modes was "bad format specification." Structured output is the deeper dive into that category: what specifically goes wrong with format, why, and what to do about it.

Why "return JSON" is not enough

When you type "return this as JSON," the AI is generating tokens one at a time, left to right, trying to maintain valid JSON syntax while simultaneously producing useful content. It has no schema enforcement. It is doing two jobs at once: being helpful and being syntactically correct.

This works fine for simple requests. "Give me a JSON object with name and email" will usually come back clean. The problems start when the structure gets more complex: nested objects, arrays of items that all need consistent fields, specific data types, fields that should be present even when the value is empty.

The model is not being lazy or difficult. It is generating text in a format it was not specifically optimized for, and every additional structural constraint is one more thing it has to hold in working memory while producing the next token. Inconsistent field names across array items happen because the model does not have a checklist; it is reconstructing the pattern from context on each new object.

How to get reliable structure from a chat

If you are working in the chat window (ChatGPT, Claude.ai, Gemini), you do not have access to API-level schema enforcement. But you can get close with three techniques that stack:

1. Show the schema, do not describe it. This is the few-shot principle applied to structure. Instead of "return a JSON object with fields for name, rating, and summary," write out the actual object:

{
  "product": "Example Widget",
  "rating": 4,
  "sentiment": "positive",
  "summary": "One sentence here."
}

The model will mirror that structure far more reliably than it will interpret a prose description of it.

2. Give two rows, not one. A single example shows the shape. Two examples show the pattern. When the model sees two objects with identical field names and types, it treats those fields as mandatory rather than suggestive. This is especially important for arrays where consistency across items is the whole point.

3. Name the constraints explicitly. "Every object must have all four fields, even if the value is null." "Use exactly these field names, no variations." "Do not include any text outside the JSON block." These feel redundant after the examples, but they're the ones that catch the edge cases where the model might improvise. Think of them as the guardrails, not the road.

These three techniques together get you 90% of the way to reliable structured output in a chat window. The remaining 10% is where API features come in.

What the APIs actually do

When you move from the chat window to building something with an API, structured output stops being a prompting problem and becomes a feature you can turn on.

Both Claude and OpenAI (and increasingly other providers) now offer structured output modes that work at a fundamentally different level than prompting. Instead of asking the model to please maintain valid JSON, the API compiles your JSON schema into a grammar that constrains which tokens the model is allowed to generate. The model cannot produce invalid JSON or deviate from your schema, because the generation process only considers tokens that would keep the output valid.

In Claude's API, you pass your schema in the request configuration. In OpenAI's API, you set the response format to "json_schema" with strict mode enabled.

The practical difference is significant. With prompt-based approaches, you are asking the model to be disciplined. With structured output features, the infrastructure enforces discipline for you. The model focuses entirely on producing good content; the system handles the structure.

This is one of the clearest examples of a theme we will return to later in this series: the gap between what you can do in a chat window and what you can do with the API. If reliable structured output matters for your use case, this is one of the strongest reasons to move from the UI to building.

Validation is still your job

Even with schema enforcement, validation is not optional. The schema guarantees structure: every field present, correct types, valid JSON. It does not guarantee accuracy. A model can return a perfectly structured object where the "sentiment" field says "positive" for a review that is clearly negative.

Structure and correctness are different problems. Schema enforcement solves the first one mechanically. The second one is still a judgment the model makes, and it can still be wrong.

For critical applications, validate the content after you validate the structure. Does the summary actually match the source material? Are the extracted values valid? Is the sentiment label consistent with the text? These checks are the same whether you got the JSON from a chat window or an API with schema enforcement.

The takeaway

When you need structured output from AI, start by writing the exact structure you want. Literal JSON. Two examples. Explicit field constraints. This is the same show-over-tell principle from earlier in the series, aimed at a specific problem.

If you need that structure to be bulletproof, the prompting approach has a ceiling. API-level schema enforcement removes that ceiling entirely. When reliable structure matters, this is one of the best reasons to explore what happens on the other side of the chat window.

The model already knows how to generate JSON. Your job is to show it exactly which JSON.

Next up: why the same prompt can behave differently in ChatGPT, Claude.ai, and the API, and what the chat window is doing that you cannot see.

How Rules and Skills Actually Work in Claude Code

Jeff Reese — Fri, 24 Apr 2026 02:35:52 +0000

If you have spent any time configuring an AI coding agent, you have probably figured out that rules and skills are different things. Rules are always loaded. Skills are invoked on demand. Rules handle recognition; skills handle procedure.

The interesting problems start after you have internalized that distinction and started building on it. I have been working with AI coding tools for over two years now, starting with Windsurf and building progressively more sophisticated systems with Claude Code. Recently I went into the Claude Code source code to understand how these mechanisms actually work at the implementation level. What I found changed how I think about the tradeoff.

Rules and skills do not just load at different times. They occupy different positions in the system, and the model treats them differently because of it.

The Failure Modes Tell You Everything

The basic distinction is useful, but it becomes powerful when you frame it through failure modes.

If you miss the moment to act, that is a rule problem. The rule was not in context when the trigger fired, so the agent did not recognize that something should happen. The moment passed silently. A rule that is not present when its trigger fires is a rule that does not exist.

If you miss a step in how to act, that is a skill problem. The agent recognized the situation but did not have the detailed procedure available. Skills contain the instructions for how to do something, and they only need to be present when that something is actively happening.

Two failure modes with two different tools to handle them. Every configuration decision becomes a question about which failure mode you are guarding against.

How Your Agent Actually Processes Input

Before I walk through the lifecycles, it helps to understand what is happening under the hood when your agent reads your configuration.

Every time Claude Code sends a request to the model, it sends several things, but two matter most for understanding where your configuration lives: a system prompt and a list of messages. These are architecturally separate inputs, and the model is trained to treat them differently.

The system prompt contains the behavioral instructions that Anthropic wrote for Claude Code: how to use tools, how to handle permissions, how to format output. This is the directive layer. You do not write this. It is the same for every Claude Code user.

You might expect your rules to live in that system prompt. They don't. The messages contain everything else: your conversation history, your questions, the tool results, and critically, your CLAUDE.md and rules content. Your rules are injected as the very first message in this conversation, wrapped in a special tag that signals "this is context, not a user question." Skills, when invoked, arrive later in the conversation as additional messages.

This means your rules are not system-level directives the way Anthropic's instructions are. They are the first thing the model reads in the conversation, which gives them a significant positional advantage, but they live in the same layer as everything else you say. Understanding this distinction matters for how you design your configuration.

Rules: The Stable Prefix

Rules load at session start and stay cached until compaction. Every rule file from .claude/rules/ and your CLAUDE.md content is read, combined, and injected as the first message in your conversation. This happens before you type anything.

That first-message position is important for three reasons.

Positional advantage. The model reads the entire context on every turn, but its attention is not uniform. Research has documented a U-shaped pattern: the beginning and end of the context get the strongest attention, while the middle gets the weakest. This is called the "lost in the middle" effect. Current models have been trained to mitigate it, so the effect is less dramatic than it was in 2023, but positional advantage is still real.

Anthropic's own long-context documentation recommends putting queries and instructions at the end of long contexts, after reference material. They are designing around the same attention dynamics.

Your rules sit at the very beginning of the conversation. Always. Your first message and the first assistant response also benefit from this beginning-of-context advantage. As the conversation grows, an invoked skill lands wherever you happen to be in the session, which in a long conversation means the middle: the weakest attention zone.

Prompt caching. Your rules do not change between turns. The model processed them on turn one, and since they are identical on turn two, the system can skip reprocessing them. This is prompt caching. It means rules are not just persistent; they are computationally cheap after the first turn. The same content arriving later in the conversation as a skill would not get this benefit, because the content before it may have changed.

Never compacted. When your conversation gets long enough, Claude Code compacts it: summarizing older messages to free up space. Your rules are never part of this compaction. They are rebuilt from disk rather than compressed with the conversation. Full fidelity, every time, regardless of how long the session runs.

The cost: every token in a rule pays rent on every turn. A 500-token rule costs 500 tokens of context on every API call. Over a 100-turn session, that single rule consumed 50,000 tokens of context. The cost is invisible, but real.

Path-Scoped Rules

There is a mechanism between always-loaded rules and on-demand skills that solves a specific problem neither can handle well.

Path-scoped rules use YAML frontmatter to specify which files they apply to. The first time you read a matching file, the rule content attaches to the file read result and enters the conversation at that point, the same position a skill would land. It costs zero tokens until that first match. If you never touch a matching file in a session, it never loads.

The trigger is mechanical. No model judgment needed, no 250-character description to interpret. You read a file, the system checks the glob, and matching rules attach. A new path-scoped rule file created mid-session is picked up on the next matching file read, no restart needed.

I use this for convention files that only matter when working with specific directories. A rule about project structure loads when I touch files in projects/. A rule about blog formatting loads when I start iterating on posts.

Skills: The Conversation Layer

Skills start small. At session startup, only a listing of available skills loads: each skill's name and a brief description, capped at 250 characters. The full SKILL.md content stays on disk. This listing is how the model knows what skills exist and when to invoke them.

When a skill is invoked, either by you typing /skill-name or by the model deciding it is relevant, the full content is read from disk and injected into the conversation as messages. Not as a system prompt. Not as first-position content. As conversation messages that arrive at the current point in the chat, interleaved with everything else.

This is the key architectural difference. Once invoked, a skill's content is in the conversation, and it stays there. It isn't removed after the turn. It is not temporary. But it is not in the same position as your rules either.

It arrives mid-conversation. A skill invoked on turn 5 sits between turn 4 and turn 6. As the conversation grows to turn 50, that skill content is deep in the middle of a long message history, competing with 45 turns of context for the model's attention. Your rules, by contrast, are still at the very beginning.

It is subject to compaction. When Claude Code compacts the conversation, invoked skills are preserved, but with limits. Each skill is capped at 5,000 tokens post-compaction, and the total budget across all invoked skills is 25,000 tokens. Skills are sorted by how recently they were invoked. If you have invoked more skills than the budget can hold, the oldest ones get dropped.

This means a rule and a skill with identical content are not equivalent, even after the skill is invoked. The rule persists at full fidelity in first position forever. The skill can be truncated or dropped under pressure.

The Split Pattern

Understanding the layer difference makes the split pattern more important.

The most common configuration problem I find is a single rule file trying to do both jobs: recognition and procedure. A few lines of recognition logic at the top ("when this situation arises, do this"), followed by fifty lines of procedural detail. The recognition needs to be always-loaded to catch the trigger. The procedure does not.

The fix: take the recognition concern and keep it as a rule. Three to five lines, always loaded, positionally advantaged. Take the procedural concern and move it to a skill that loads only when invoked. The recognition fires in the stable prefix. The procedure loads just in time into the conversation.

I applied this to five rules in my own configuration. The split saved roughly 120 lines of always-loaded context. Same behavior, dramatically less overhead. And now the procedure benefits from being loaded fresh at the point of use, rather than sitting in context for turns where it is irrelevant.

There is a reliability argument here too, not just a cost one. A skill's own description (capped at 250 characters, sitting somewhere in the conversation messages) is what the model reads when deciding whether to auto-invoke it. That description is competing with everything else in the conversation for attention. A three-line rule in the stable prefix will fire more reliably than a 250-character skill description hoping the model recognizes the situation on its own. If reliable triggering matters, the trigger belongs in a rule regardless of how good the skill description is. The rule catches the moment. The skill delivers the procedure. Each mechanism doing what it is best at.

A Decision Framework

When I add new behavior to my system, I ask four questions.

Does this need to be recognized before it is invoked? If yes, the trigger belongs in a rule. Keep it short. Just enough for pattern matching.
Does this require detailed procedural steps? If yes, those steps belong in a skill that loads on demand.
Is this tied to a specific file context? If yes, a path-scoped rule gives you deferred loading with mechanical reliability. No model judgment needed. Zero cost until you first touch a matching file.
Does this need to persist through long sessions at full fidelity? If yes, it belongs in a rule. Skills can be truncated after compaction in extended conversations.

The goal is the right information in context at the right time. Every token should be there because the current task needs it.

There Is a Third Piece

Rules handle recognition. Skills handle procedure. There is a third mechanism in Claude Code that handles neither. It handles the things that should happen automatically, without judgment, every single time. No recognition needed, no procedure to follow. Just a deterministic response to a specific event.

Think of it as the autonomic nervous system of your configuration. Your rules are conscious decisions. Your skills are learned procedures you invoke deliberately. Hooks are your reflexes, the responses that fire without you thinking about them, because if you had to think about them, you would eventually forget.

That deserves its own post. For now, if you find yourself writing a rule that says "every time X happens, always do Y," you are probably describing a hook, not a rule. The difference between "remember to do this" and "this just happens" is the difference between compliance and architecture.

If you are interested in the engineering principles behind structuring AI configuration at scale, I wrote about applying SOLID principles to this problem in a separate post. The context cost argument and the separation-of-concerns patterns are covered there in depth.