DEV Community: naoki_JPN

I used Claude Code + Codex for 2 months and hit 3.77 billion tokens in a single month

naoki_JPN — Fri, 08 May 2026 09:24:56 +0000

Note: Information in this article is current as of May 2026.

I've been using both Claude Code and OpenAI Codex for personal development for two months. I wanted to get a clearer picture of my actual usage, so I tracked token consumption properly. The total from March through early May (up to 5/7) exceeded 4 billion tokens, with April alone hitting 3.77 billion tokens.

How I measured

Claude Code
Claude Code stores conversation logs as JSONL files under ~/.claude/projects/. Each record includes model name, timestamp, and token counts for input/output/cache_creation/cache_read. I parsed these and aggregated by month and model.

Codex
Codex stores per-session JSONL logs under ~/.codex/sessions/. The log contains multiple token_count events; I used the last total_token_usage per session as the usage for that session.

Monthly Trends

Total

Month	Total Tokens	Total Cost
March 2026	265M	~$115
April 2026	3,770M	~$2,100
May 2026 ※	217M	~$155

Breakdown

Month	Claude Tokens	Claude Cost	Codex Tokens	Codex Cost
March 2026	263M	$115	2M	~$0.3
April 2026	3,570M	$1,950	196M	~$147
May 2026 ※	157M	$112	60M	~$43

※ Both Claude and Codex data through 2026-05-07.

Claude costs are API-rate equivalents. Codex costs are estimated from published model rates (gpt-5.3-codex: $1.75/$14 per M, gpt-5.4: $2.50/$15 per M, gpt-5.5: $5.00/$30 per M, cached input at 10% of input rate).

March was when I started using Claude Code — I kept hitting rate limits and upgraded my plan repeatedly. In April I pushed harder, and around mid-April I introduced Codex to handle overflow when Claude hit its limits. April combined reached 3.77 billion tokens, about 14x March.

What happened in April

Breaking it down by day, April 13 alone consumed 220 million tokens.

Date	Claude Code Tokens
Apr 9	11M
Apr 10	14.8M
Apr 13	219.8M ←
Apr 14	4.09M
Apr 16	6.29M
Apr 20	23.37M
Apr 23	11.9M

That single week (Apr 13 week) accounted for 68% of the month. Apr 13 stands out — it matches a day when I was running Agent/Subagents in heavy parallel.

Model breakdown

Claude Code

Model	Period	Tokens	Est. Cost
claude-haiku-4-5	Mar–May	410M	$81
claude-sonnet-4-6	Mar–May	2,520M	$1,220
claude-opus-4-6	Mar–Apr	270M	$284
claude-opus-4-7	Apr–May	780M	$593

Sonnet was the main workhorse. Opus-4-7 launched in April and I adopted it immediately.

Codex

Model	Period	Tokens	Est. Cost
gpt-5.3-codex	March	2M	~$0.3
gpt-5.4	April	69M	~$43
gpt-5.5	Late Apr–May	189M	~$148

Model evolution was rapid — three generations in three months.

Cache

Both tools are heavily cache-dependent, but in different ways.

Claude Code

Type	Mar–Apr Total
raw input	342K
output	17.5M
cache creation	144M
cache read	3,770M
total	3,830M

96% of all tokens are cache reads. Claude Code re-injects large contexts (CLAUDE.md, codebase, conversation history) via cache on every session, keeping raw input tiny.

Codex

Model	Cached input ratio
gpt-5.3-codex	80%
gpt-5.4	91%
gpt-5.5	96%

94% of Codex input is cached. The ratio increases with newer models. gpt-5.5 has a higher per-token price, but in my logs the high cached-input ratio kept actual costs in check. Looking at raw input/output alone misrepresents what's really happening in both tools.

Division of labor

These two tools aren't competing — they cover different roles. Claude Code handles Biz tasks (docs, research, design, organizing), while Codex is dedicated to Dev (implementation).

That said, the division shifted when gpt-5.5 arrived. Up to gpt-5.4, Codex felt like talking to a senior engineer who only cared about technical correctness — no real dialogue. So I only used it as a backend called from Claude. With gpt-5.5, it finally felt like a real conversation partner, and I started giving it implementation instructions directly. Now the flow is: Claude creates tickets, Codex handles implementation.

The token volume difference reflects the nature of the tasks. Biz work generates large contexts and long conversations. On top of that, Codex only ran at full capacity from mid-April, so the periods aren't even comparable. A raw quantity comparison doesn't mean much.

Takeaways

The April explosion coincides with when I seriously started testing Multi-agent parallel execution. Running 10 Subagents simultaneously sends token counts through the roof fast
High cache dependency means both tools are designed for large, persistent contexts. The more you invest in CLAUDE.md and documentation, the more the cache works for you
Codex model iteration is fast — three generations in three months. The high cached-input ratio on gpt-5.5 is cost-efficient in practice
Token composition matters. Claude Code is dominated by output and cache_read; Codex is dominated by cached input. Raw input/output alone gives a misleading picture

Measuring this made my "just using it" habits much clearer. Cache ratio and Agent parallel execution impact are things you simply don't see unless you look.

Feel free to leave a comment if you'd like to discuss how to approach AI adoption in your organization.

I shipped a paid web app in half a day using Claude Code + Codex (Garmin AI CSV converter)

naoki_JPN — Thu, 07 May 2026 12:20:22 +0000

Note: This article reflects information as of May 2026.

What I built

Garmin AI Export — a web app that converts the activity, sleep, and health data ZIP exported from Garmin Connect into clean CSVs that ChatGPT, Gemini, and Claude can readily analyze.

🔗 https://garmin-ai-export.vercel.app

If you've worn a Garmin watch for years, you've accumulated a massive trail of data: runs, sleep records, heart rate, stress, Body Battery — all of it. Garmin Connect lets you export everything, but what you get is a tangled bundle of nested JSON and binary FIT files. Hand that directly to ChatGPT and tell it to "analyze this," and it won't know where to start.

This app:

Takes the raw ZIP you exported from Garmin Connect
Unpacks and parses it entirely in your browser
Returns a ZIP containing four files — activities.csv, sleep.csv, daily_health.csv, laps.csv — plus an AI prompt template

Feed that into the Code Interpreter on ChatGPT, Gemini, or Claude and you can immediately ask things like "show me how my pace has trended" or "find correlations between sleep and next-day performance."

And here's the punchline: I built and shipped this — including the payment integration — in about half a day, by pairing Claude Code (Opus 4.7) with Codex (GPT-5 Codex). I'll get into the role-split below, but the short version is: Claude Code played PM and reviewer; Codex did the implementation.

Tech stack

Next.js (App Router) + Tailwind CSS
JSZip — ZIP unpacking and re-compression
@streamparser/json — streaming parser for huge JSON
fit-file-parser — for Garmin's binary .fit files
PapaParse — CSV generation
Web Worker — to keep the main thread responsive
Square Payment Links — payments (Apple Pay / Google Pay supported)
Vercel — hosting

The key principle is "everything in the browser." Garmin exports can be hundreds of megabytes to multiple gigabytes — there's no way you're streaming that to a server (Vercel's request body limit is 4.5MB; you're done before you start). Plus, on principle, I didn't want users' health data sitting on my server.

So upload, parse, convert, and download all happen inside the user's browser. The only server-side code is a tiny API route that creates a Square payment link.

How Claude Code and Codex split the work

This project ran on two AI agents with clearly separated roles.

Role	Agent	Responsibilities
PM / Research	Claude Code (Opus 4.7)	Translate requests into GitHub Issues, manage labels, set priorities
Implementation	Codex (GPT-5 Codex)	Cut branches, write code, open Draft PRs
Code review	Claude Code (Opus 4.7)	Read PR diffs, leave sharp inline comments on GitHub
Merge / deploy	Claude Code (Opus 4.7)	Approve, squash-merge, verify production
Approval / GO	Me (the human)	Just say "OK," "implement #X," "ship it"

The actual flow looks like:

I describe what I want
  → Claude Code creates a GitHub Issue (label: triage)
  → I say "OK" → label changes to ready
  → I say "implement #X"
  → Codex cuts a branch, writes code, opens a Draft PR
  → I say "please review"
  → Claude Code posts a thorough review on GitHub
  → Codex addresses the comments
  → Claude Code re-reviews → LGTM, squash merge
  → Vercel auto-deploys
  → Claude Code curl-checks production

What's interesting about this setup is that using a different model for review actually works. When Claude Code reads code that Codex wrote, there's no self-confirmation bias — comments like "disabled={!result} is dead code here," "you're using a private JSZip API," "the worker isn't terminated on reset" surface naturally. You don't get this when the same model writes and reviews.

I never wrote a line of code. I only said "I want this," "OK," and "ship it."

Things that bit me

I stepped on a few mines worth documenting.

1. Parsing 135MB JSON on the main thread froze the page

Garmin's *_summarizedActivities.json is over 100MB for heavy users. A naive JSON.parse(text) blocks the main thread for ~20 seconds and the browser starts asking the user if they want to kill the page.

The fix is two-layered.

(a) Move it to a Web Worker

// src/workers/garmin-converter.worker.ts
self.onmessage = async (event) => {
  const result = await convertGarminExportCoreBuffer(event.data.file, postProgress);
  self.postMessage({ type: "done", result }, { transfer: [result.buffer] });
};

Putting the ArrayBuffer in the transfer list does a zero-copy handoff to the main thread. Even a 100MB+ buffer crosses the worker boundary instantly because nothing is copied.

(b) Stream-parse the JSON itself

Use @streamparser/json to iterate over the giant array element-by-element:

import { JSONParser } from "@streamparser/json";

const parser = new JSONParser({
  paths: ["$.*.summarizedActivitiesExport.*"],
});

parser.onValue = ({ value }) => {
  rows.push(extractActivityRow(value));
};

The $.* prefix in paths is a sneaky trap. The actual Garmin JSON shape is [{"summarizedActivitiesExport": [...]}] — an array at the root. So you have to indicate "the array root" with $.*, otherwise nothing matches and your CSV ends up empty. I forgot this on the first pass and spent way too long staring at empty output.

2. Reaching into JSZip's private API and getting burned

The first version of the code, written by Claude, was using zip.file().internalStream("uint8array"). It worked, but it's an undocumented JSZip internal API with no type definitions. Review caught it and we rewrote it:

const buffer = await entry.async("uint8array");
const chunkSize = 1 << 20; // 1MB
for (let offset = 0; offset < buffer.length; offset += chunkSize) {
  // process 1MB at a time, yielding back to the event loop
  await yieldToEventLoop();
  processChunk(buffer.subarray(offset, offset + chunkSize));
}

yieldToEventLoop() is just new Promise(resolve => setTimeout(resolve, 0)). Without it, you're back to freezing the UI on huge files.

3. The iOS Safari Blob + IndexedDB landmine

For the payment gate, I needed to persist the converted ZIP somewhere while the user bounces over to Square's checkout and back. The natural choice: stash a Blob in IndexedDB. But:

The biggest risk is the case where iOS Safari can't restore the Blob after payment.

This actually happens. iOS Safari has a long-standing bug where Blobs stored in IndexedDB come back corrupted when retrieved, especially for larger Blobs. "Paid the money, can't get the file" is the worst possible UX, so:

// Blob → ArrayBuffer before storing
const buffer = await result.blob.arrayBuffer();
await store.put({ buffer, files, filename, ... });

// Reconstruct the Blob from the ArrayBuffer when retrieving
const blob = new Blob([record.buffer], { type: "application/zip" });

ArrayBuffer round-trips cleanly through IndexedDB even on iOS Safari. Catching this in review saved a lot of pain.

4. The Worker keeping running after a Reset

If a user clicks "Reset" while a conversion is in progress, you have to actually terminate the worker — otherwise it leaks memory and burns CPU.

let activeWorker: Worker | null = null;

export function abortConversion() {
  activeWorker?.terminate();
  activeWorker = null;
}

Easy to forget. A review comment of "is this still running in the background after reset?" is what surfaced it.

Adding payments

I gated the download behind a paywall.

Why Square?

Candidates were Stripe, PayPal, Square, Paddle. Reasons I picked Square:

Stripe rejected my application (sole-proprietor application takes time)
Apple Pay and Google Pay are included by default (zero extra implementation)
Payment Links API gives you hosted checkout in seconds — no card form on my own site, so no PCI DSS scope

The implementation is shockingly small. Server-side, you just hit Square's API to mint a payment link:

// src/app/api/square/payment-link/route.ts
const squareResponse = await fetch(
  `${getSquareApiBaseUrl()}/v2/online-checkout/payment-links`,
  {
    method: "POST",
    headers: {
      Authorization: `Bearer ${process.env.SQUARE_ACCESS_TOKEN}`,
      "Content-Type": "application/json",
      "Square-Version": "2026-01-22",
    },
    body: JSON.stringify({
      idempotency_key: crypto.randomUUID(),
      quick_pay: {
        name: "Garmin AI Export",
        price_money: { amount: PRICE, currency: "JPY" },
        location_id: process.env.SQUARE_LOCATION_ID,
      },
      checkout_options: {
        allow_tipping: false,
        redirect_url: "https://garmin-ai-export.vercel.app?paid=true",
      },
    }),
  },
);

Redirect the user to the returned payment_link.url, they pay (card or Apple Pay or Google Pay) on Square's hosted checkout, and they come back to your site with ?paid=true appended.

Soft gate, by design

I do not server-side verify that the user actually paid. If ?paid=true is in the URL, the download unlocks.

This is intentional:

I have no database and no user accounts, so there's no real way to verify (you could log purchases via webhook, but you'd need a way to identify users)
For a tool meant for individual personal use, someone bypassing it by typing the URL is fine
Philosophy: "If you feel like paying, please pay. If not, that's also fine."

About currency

Square only lets you receive revenue in your account's home country currency. My account is registered in Japan, so JPY only. When an overseas user pays via Apple Pay, Apple does the FX conversion on their side, and the merchant always sees JPY in the books. So an English UI with JPY pricing is perfectly workable for international users.

Deploy and cost

Moving from Vercel Hobby to Pro immediately

The Vercel Hobby plan is strictly "personal, non-commercial use." Running a paid site on Hobby violates the ToS and risks account termination.

→ I upgraded to Pro ($20/month) right after wiring up payments.

Bandwidth is a non-issue

I worried at first: "These Garmin ZIPs can be hundreds of MB — am I going to blow through bandwidth?" But once I thought about it:

Uploading the Garmin ZIP → handled in-browser (never touches Vercel)
Downloading the converted ZIP → from browser to local disk (never touches Vercel)
Payment screen → on Square's domain (never touches Vercel)

Vercel only serves the initial page load (~2MB) and the Square API call (~1KB). Pro's 1TB allotment covers ~500,000 users. The browser-only architecture really pays off here.

Google Analytics and a favicon

Last touches: GA and a custom favicon.

GA4 via next/script with afterInteractive:

{SHOULD_LOAD_GOOGLE_ANALYTICS ? (
  <>
    <Script
      src={`https://www.googletagmanager.com/gtag/js?id=${GA_MEASUREMENT_ID}`}
      strategy="afterInteractive"
    />
    <Script id="google-analytics" strategy="afterInteractive">
      {`window.dataLayer = window.dataLayer || [];
        function gtag(){dataLayer.push(arguments);}
        gtag('js', new Date());
        gtag('config', '${GA_MEASUREMENT_ID}');`}
    </Script>
  </>
) : null}

Favicons in Next.js App Router are convention-based — drop src/app/icon.tsx and src/app/apple-icon.tsx and they're auto-served at /icon and /apple-icon. You generate them dynamically from ImageResponse so no external image file required:

// src/app/app-icon-image.tsx
export function createAppIconResponse(size: { width: number; height: number }) {
  return new ImageResponse(
    (
      <div style={{ background: "#104c3f", borderRadius: "20%", /* ... */ }}>
        <svg viewBox="0 0 24 24" fill="none" stroke="white">
          {/* FileArchive icon SVG paths */}
        </svg>
      </div>
    ),
    size,
  );
}

Reflections on building with Claude Code + Codex

What worked well

The role-split is fast. Claude Code focusing on requirement-shaping and review while Codex steadily implements is a great fit for "one human plus a fleet of AIs."
Mixed-model review actually catches things. When the same model writes and reviews, it misses stuff. A different model has no qualms about saying "wait, this is wrong."
When I got stuck, Claude Code could surface fixes like "iOS Safari has a known IndexedDB Blob bug, switch to ArrayBuffer storage" without me prompting it.
Codex is callable from CLI, so I can shoot off "implement this issue and open a PR" via Bash in one shot.

What I made sure to do

Never let "I'd kinda want this" turn into "started implementing." I wrote an explicit rule in CLAUDE.md: implementation only starts on "implement #X."
Reviews always go to a different model (Claude Code) so the implementer (Codex) doesn't grade its own homework.
I don't say "merge it" — I say "ship it to main" / "release to prod" so deployments are an explicit step.

The "I'll just whip up a thing" mode is incredibly powerful. The flip side is that without checks, it'll happily over-build, so the review-and-approval ritual is worth keeping.

Wrap-up

Half a day → a working web app with payments, analytics, and a custom favicon
Browser-only architecture wins on both server cost and privacy
Claude Code (PM/reviewer) + Codex (implementer) as a two-agent setup gives you both speed and quality

If you're a Garmin user curious to throw your own activity data at ChatGPT, Gemini, or Claude, give it a try.

🔗 https://garmin-ai-export.vercel.app

Building Production AI Agents with Google Cloud ADK + Claude [30-min Workshop]

naoki_JPN — Thu, 07 May 2026 07:15:54 +0000

Note: This article summarizes the following X post video (approx. 30 min) in English.
Speaker: Ivan Nardini (Google Cloud Developer Relations Engineer, AI/ML) / Recorded at an Anthropic-hosted event.
Original YouTube: Building AI agents with Claude in Google Cloud's Vertex AI | Code w/ Claude

Introduction

You've built an AI agent — but can't ship it to production. That's the wall Ivan Nardini (Google Cloud) dismantles in this 30-minute workshop.

Using ADK, MCP, Vertex AI Agent Engine, and A2A Protocol, he walks through building and deploying a multi-agent system powered by Claude — end to end.

Why AI Agents Are Hard to Productionize

Prototypes are easy. Production is hard. Three root causes:

Challenge	Details
Fragmented landscape	Too many frameworks — unclear what to choose
Hard to integrate	Cross-framework agent communication is complex
Lack of ops & governance	Monitoring, logging, and scaling must all be hand-rolled

Google Cloud's Agentic Stack is designed to solve all three.

The Google Cloud Agentic Stack

Four layers, each targeting one of the above challenges:

Layer	Role
Agent Development Kit (ADK)	Open-source, code-first agent development framework
Model Context Protocol (MCP)	Open protocol standardizing how apps provide context to LLMs
Vertex AI Agent Engine	Managed platform for deploying and scaling agents in production
Agent2Agent (A2A) Protocol	Open standard enabling cross-framework agent collaboration

Demo 1: Build Your First Agent with 3 Files

Using a birthday planner agent as the example:

from google.adk.agents import LlmAgent
from google.adk.models.anthropic_llm import Claude
from google.adk.models.registry import LLMRegistry

root_agent = LlmAgent(
    name="birthday_planner",
    model="claude-3-7-sonnet@20250219",
    description="An agent that helps plan birthday parties",
    instruction="Handle guest lists, venue suggestions, and scheduling..."
)

Just three files: agent.py, .env, requirements.txt. One command to run:

adk run birthday_planner    # CLI interaction
adk web                     # Browser UI + debug view

ADK supports LlmAgent, SequentialAgent, and other patterns — compatible with Claude, Gemini, and more.

Demo 2: Go Multi-Agent with MCP

To extend the birthday planner to also schedule calendar events, you add two more agents and an orchestrator:

BirthdayPlannerAgent — party suggestions
CalendarServiceAgent — calendar operations via MCP server
EventOrganizerAgent — routes requests to the right agent

Connecting an MCP server is two lines:

mcp_tools, exit_stack = await MCPToolset.from_server(
    connection_params=SseServerParams(url=MCP_CALENDAR_SERVER_URL)
)

agent = LlmAgent(
    name="CalendarServiceAgent",
    model="claude-3-7-sonnet@20250219",
    tools=mcp_tools,
    ...
)

Any existing MCP server can be plugged in as a tool. The orchestrator auto-routes requests based on agent descriptions.

Demo 3: Deploy to Vertex AI Agent Engine

agent_engines.create(
    agent=root_agent,
    requirements=["google-cloud-aiplatform[adk]"]
)

What you get automatically after deploy:

Observability via Cloud Trace / Logging / Monitoring
Session management (persistent conversation history)
Integration with Vertex AI Evaluation Service for continuous improvement

Works with LangGraph, LangChain, LlamaIndex, and CrewAI too — not just ADK.

Bonus: A2A Protocol — Cross-Framework Agent Communication

When you need a LangChain agent and an ADK agent to collaborate, you need a shared language: Agent2Agent (A2A) Protocol.

Two core concepts:

Agent Card: A digital business card for the agent — lets other agents discover what it can do
Agent Skills: Describes the agent's specific capabilities and API

Built on HTTP / JSON-RPC, enterprise-ready security included.

Summary

Takeaway	Detail
ADK: 3 files, 1 command	Fastest path to a working agent
MCP: 2 lines	Plug in any existing MCP server as a tool
Agent Engine: zero-ops deploy	Observability, scaling, sessions — all managed
A2A: break the framework wall	Claude, Gemini, LangChain, CrewAI can coexist

ADK + MCP + Agent Engine + A2A gives you a complete stack from local dev to production scale.

Anthropic's Prompting 101 — A Practical Guide to Building Production-Quality Claude Prompts

naoki_JPN — Fri, 01 May 2026 05:27:23 +0000

Note: This article is a Japanese summary of a ~25-minute video posted by @jota_snchez on X. This is the English translation. Original video: https://x.com/jota_snchez/status/2049898145346105395

Introduction

Hannah Moran and Christian Ryan from Anthropic's Applied AI Team walk through prompt engineering best practices with live console demos.

Using a real customer case — having Claude analyze Swedish car accident insurance forms — they show how to evolve a prompt from step one through five versions, going from "Claude thinks it's a ski accident" to production-quality structured output. Every iteration is highly instructive.

The Basic Prompt Structure

Anthropic recommends organizing prompts around 5 core elements:

#	Element	Description
1	Task description	1–2 sentences defining Claude's role and the task
2	Dynamic content	Data, images, or retrieved information to process
3	Detailed instructions	Step-by-step guidance on how to approach the task
4	Examples (optional)	Few-shot samples
5	Reminder of critical points	Restate the most important rules at the end

Note: For long prompts, repeating critical instructions at the end is especially effective.

Organizing Information: XML Tags

Claude excels with structured information. Anthropic's top recommendation is using XML tags as delimiters:

<user_preferences>
  {{USER_PREFERENCES}}
</user_preferences>

Explicitly declares what's inside the tags
Makes it easier for Claude to reference that information later in the prompt
Clearer boundaries than Markdown, and more token-efficient

Disorganized prompts are hard for Claude to parse and degrade output quality. XML tags alone can make a significant difference.

Live Demo: Building a Prompt Step by Step

The demo task: "Determine which vehicle is at fault from a Swedish car accident report form." They built the prompt by adding elements one at a time in the console.

V1 → V2: Adding Task Context and Tone

V1's problem: Claude output "a ski accident occurred on Chapman Gotham Street" — a wild miss because there was zero background context.

What V2 added:

This is an auto insurance claims processing system
Inputs are a Swedish accident report form and a hand-drawn sketch
Do not make a determination if not confident (hallucination prevention)

→ Claude now correctly identifies it as a car accident, but the verdict is still vague due to missing information.

V3: Adding Background Information to the System Prompt

Added the form's structure (17 checkboxes, two columns for Vehicle A and B) to the system prompt.

Note: Static information belongs in the system prompt.
The form structure never changes. This type of static background is ideal for the system prompt — and maximizes prompt caching effectiveness.

→ Form reading accuracy improved. Claude issued its first clear verdict: "Vehicle B is at fault."

V4: Detailed Step-by-Step Instructions (Order Matters)

1. First, carefully examine the form and list every checked box
2. Then analyze the sketch (informed by what you learned from the form)
3. Deliver your final verdict

"Read the form before the sketch" is the critical ordering. A hand-drawn sketch alone is meaningless — but once you've read the form and know you're dealing with a car accident, the sketch makes sense. Mirror the order a human would naturally work through this.

V5: Specifying Output Format

Wrap your final verdict in <final_verdict> XML tags.

→ The application can now extract just the information it needs (the verdict) from the XML tag. Ski accident misread → ambiguous → confident structured output — the evolution is complete.

Additional Techniques

Few-Shot Examples

Label difficult edge cases with human annotations and add them as examples. Images can be Base64-encoded and included in the samples. Production systems often carry dozens to hundreds of examples.

Conversation History

For user-facing applications, passing prior conversation history as context improves accuracy.

Pre-fill (Specifying the Start of Output)

Set a starting string in the Assistant role to force Claude's output format:

messages = [
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "<final_verdict>"}  # ← pre-fill
]

Claude will continue from <final_verdict>. The same works for forcing JSON output.

Extended Thinking

Available in Claude 3.7+. Claude's reasoning process appears in <thinking> tags.

⚠️ Warning: Treat Extended Thinking as a diagnostic tool, not a permanent crutch. Use it to identify where Claude struggles, then encode those reasoning steps as explicit instructions in the system prompt. That approach achieves the same quality without Extended Thinking — and uses fewer tokens.

Summary

Technique	Effect
Explicit task context	Prevents off-base interpretations
Static info in system prompt	Maximizes prompt caching
XML tag structure	Improves information retrieval accuracy
Specify processing order	Mirrors human reasoning order
Specify output format	Simplifies app integration
Few-shot examples	Improves accuracy on hard cases
Pre-fill	Forces output format
Extended Thinking	Visualizes reasoning for debugging

Prompt engineering is an iterative empirical science. Build test cases, find failure patterns, encode fixes into the system prompt — keep running this loop to reach production quality.

Full Video Transcript

Opening (0:00)

Hey everyone, thank you for joining us today for Prompting 101. My name is Hannah, I'm part of the Applied AI Team at Anthropic. With me is Christian, also from the Applied AI Team. Today we're going to take you through some prompting best practices using a real-world scenario and build up a prompt together.

Prompt Engineering is the practice of writing clear instructions for the model, giving the model the context it needs to complete a task, and thinking through how to arrange that information for the best result. The best way to learn this is just to practice doing it.

We're using an example inspired by a real customer we worked with — analyzing images and having Claude make a judgment about what it finds there. I don't speak the language this content is in, but luckily Christian and Claude both do.

Scenario Introduction (1:00)

Christian: Imagine you're working for an auto insurance company, dealing with car insurance claims daily. You have two pieces of information: a car accident report form in Swedish (17 checkboxes detailing what happened) and a hand-drawn sketch of how the accident occurred. We want to pass these to Claude and determine who is at fault.

Let's start by just throwing them into the console and seeing what happens.

Console settings: claude-sonnet (latest model), temperature 0, large max token budget.

First prompt: "This is an accident report form. Determine what happened and who is at fault."

Result: Claude thinks it's a ski accident on "Chapman Gotham Street" — a very common street in Sweden. You can understand this: in the prompt we haven't done anything to set the stage about what's actually taking place. Claude's first guess isn't terrible, but we have a lot of intuition we can bake in.

Best Practices: Prompt Structure (4:00)

Prompt engineering is iterative empirical science. We could have a test case where Claude needs to understand it's in a vehicular environment, not a skiing one, and iteratively build the prompt from there.

Anthropic's recommended structure:

Task description — tell Claude what it's here to do, its role, what task it's trying to accomplish
Dynamic content — in this case, the images; may also be information retrieved from another system
Detailed instructions — almost like a step-by-step list of how we want Claude to tackle the reasoning
Examples — here's an example piece of content; here's how you should respond
Repeat critical instructions — review the information with Claude, emphasize things that are extra critical, then tell Claude to go ahead

Building V2 (6:00)

Christian: Starting with task context. We want to give clearer instructions and make sure Claude understands what we're doing. We also add tone: Claude should be factual and confident. If Claude can understand what it's looking at, we want that assessment to be as clear and confident as possible.

Back in the console, V2 explicitly labels the data — this is a car accident report form with Vehicle A and Vehicle B in left and right columns. The system prompt specifies that this AI system assists a human claims adjuster reviewing Swedish car accident report forms. It should not make an assessment if it's not fully confident.

Running it: Claude now correctly identifies it as car accidents — not skiing. It can pick up that Vehicle A checked box 1 and Vehicle B checked box 12. Scrolling down, Claude still says there's information missing to make a fully confident determination. This is great — it's behaving as instructed. But there's a lot of information still missing regarding what the form actually entails.

V3: Background Information and Structure (9:00)

Hannah: Next we add background data, documents, and images. We actually know a lot about this form — it will be the same every single time. This is a great type of information to put into the system prompt, and a great candidate for prompt caching since it will always be the same. This helps Claude spend less time figuring out what the form is each time.

Claude loves structure and organization. XML tags let you specify what's inside those tags — <user_preferences> tells Claude everything wrapped in those tags is related to user preferences. Claude understands all types of delimiters; we prefer XML because its boundaries are clear and it's token-efficient.

In V3, we tell Claude everything about the form: it's a Swedish car accident form, it'll have this title, two columns representing different vehicles, and what each of the 17 rows means. We also tell it that humans fill this out — so it won't be perfect, people might put a circle, might scribble, might not put an X in the box.

Running it: Claude spends less time narrating the form to us because it already knows what it is. It gives us a list of what's checked and — Claude now confidently says Vehicle B is at fault based on the drawing and the sketch.

V4: Detailed Instructions (14:00)

Hannah: One thing we really highlight: examples. Few-shot is a mechanism that's really powerful for steering Claude. You can bake in concrete accidents that were tricky for Claude to get right — with human-labeled correct conclusions. You can include visual examples using Base64-encoded images. This is how you push the limits of your LLM application. If you're building this for an insurance company, you might have 10, maybe hundreds of examples of difficult edge cases.

Conversation history: not used here, but for user-facing apps with long history, this is the right place to bring that in.

Next step: a reminder of the immediate task and important guidelines. Preventing hallucinations — we don't want Claude to invent details it's not finding in the data. If the sketch is unintelligible and even a human couldn't figure it out, we want Claude to be able to say that.

In V4, we keep the system prompt the same and add a detailed task list. The order in which Claude analyzes this information is very important. You'd probably not look at the drawing first — it's just boxes and lines without context. But if you read the form first, understand we're talking about a car accident and see checkboxes indicating what vehicles were doing, then you know how to interpret the drawing.

So: first, carefully examine the form, make sure you can tell what boxes are checked, make a list. Then move to the sketch, informed by what you learned.

Running it: Claude now very carefully examines each and every box. It gives structured XML output: form analysis, accident summary, sketch analysis. It continues to say Vehicle B appears to be clearly at fault. With more complicated drawings and less clarity in forms, this step-by-step thinking is really impactful.

V5: Output Format and Pre-fill (19:00)

Christian: Final step: we keep the system prompt the same and add important guidelines. Summary should be clear, concise, and accurate. Nothing should impede Claude's assessment. Then output formatting: wrap the final verdict in <final_verdict> XML tags so the application can extract just the verdict.

Running it: much more succinct. At the end, the output is wrapped in <final_verdict> tags. We've gone from skiing accident to uncertain to unconfident secure output to now a much more strictly formatted, confident output we can build a real application around.

Christian: Another key way to shape output is pre-filled responses. If you want structured JSON output, you just add that Claude needs to begin its output with a certain format. In the Assistant field, write <final_verdict> or { — Claude will continue from where you left off. This gives you greater control over output formatting without the preamble.

Finally: Extended Thinking in Claude 3.7+. You can use this as a crutch for prompt engineering — enable it to make sure Claude has time to think. The beauty is you can analyze that thinking transcript to understand how Claude goes about the data. Try to help Claude by building this into your system prompt itself. It's more token-efficient, and it's a good way of understanding how these models actually go about the data. That's key to making your system prompt a lot better.

Thank you everyone for coming. We'll be around all day for questions. Don't miss "Prompting for Agents" and the Claude plays Pokémon demo!

OpenAI Codex Desktop Complete Guide — Mastering Skills, Plugins & Automations

naoki_JPN — Thu, 30 Apr 2026 03:38:50 +0000

Note: This article is a Japanese summary of a ~103-minute video by @DeRonin_ on X. This is the English translation. Original video: https://twitter.com/DeRonin_/status/2048823420977119727

Introduction

The OpenAI Codex desktop app is a comprehensive AI agent platform that goes far beyond coding assistance — covering design, document creation, research, and automation. This article summarizes the full 103-minute guide video.

Core Features of the Codex Desktop App

Here are the key features introduced at the start of the video.

Project Management and File Organization

Codex manages chats in "project" units, each linked 1:1 to a local folder on your computer. Files generated through chat are automatically saved to an outputs/ folder inside the project directory, and any file in that folder can be referenced with @filename. You can open the folder instantly via the "Open in Finder" button.

Parallel Multitasking

You can run multiple chat threads simultaneously. Even while one agent is working, you can start new tasks in another chat. A blue dot notification appears when a task completes, so you can check results and give the next instruction right away.

Skills and Plugins

Skills are "reusable recipes"; plugins are "installable packages that bring those recipes into Codex." Hundreds of pre-built plugins exist for services like Google Calendar, Gmail, Figma, and Remotion. You can also combine external APIs with the skill creator to build your own custom skills. Once created, skills can be invoked in future sessions with /skill-name or @skill-name.

Automations

Set up recurring tasks with natural language — for example, "Every Friday at 4am, summarize my weekly calendar and send it via email." You can view, test, and edit automations from the Automations tab.

Computer Control

The agent literally controls your mouse and keyboard. This enables working with GUI apps that have no API, such as building apps in Xcode or navigating a browser.

In-App Image Generation

Generate images from prompts and use them directly in your workflow. The video demonstrated generating product images for a shoe brand and 10 iOS app icon variations. Transparent background generation is also supported.

Steer Feature

Even while an agent is processing, you can paste text or images and immediately redirect it ("fix this part"). Normally prompts queue up and wait their turn, but the "Steer" button lets you interrupt instantly.

Terminal Integration (Claude Code)

For design-heavy tasks, you can launch Claude Code from the terminal with claude --dangerously-skip-permissions. In the video, Claude Code was used to finalize landing pages and slide decks when Codex's design precision reached its limits.

Canva Export

Created PowerPoint files can be opened in Canva with one click for manual finishing of the last 5–10%.

The Difference Between Skills and Plugins

The video's mid-section used the Excalidraw skill to auto-generate a structure diagram.

	Skill	Plugin
Definition	A reusable workflow package for specific tasks	A unit that installs additional functionality into Codex
Role	Bundles instructions, resources, and scripts to extend Codex's task-handling ability	Bundles skills, apps, MCP Servers, and integrations
Purpose	A recipe that ensures Codex executes workflows reliably	Provides access to connected systems and packaged tools

Note: Simple way to remember:

Skill = reusable recipe

Plugin = installable package that brings that recipe into Codex

Design Tool Integration (Paper / Figma)

Codex integrates with Paper (Alpha), a Figma-like design tool.

Demo flow:

Prompt: "Using the new Noo Shoo company logo image, create a landing page directly in Paper"
Codex confirms Paper MCP actions and selects a transparent hero image
Codex auto-decides design direction: editorial-tech, warm near-black neutral, cyan accents
Auto-builds 4 sections: Hero, Performance Strip, Product Story, CTA/Footer

Note: Paper is a design tool built for AI agent collaboration, offering more intuitive operation than direct Figma editing.

Automations

Automations can be created just by typing "do X every week" in chat. The video demonstrated two:

Weekly Calendar Summary
After connecting Google Calendar and Gmail plugins, just say "Every Friday at 4am, summarize this week's schedule and send it via email." Done. You can immediately see when the next run is scheduled.

Monthly YouTube Report
After creating a YouTube Researcher skill with the SuperData API, instruct: "On the last day of each month, use that skill to analyze this month's videos and compile them into a Word document." The resulting report includes hook analysis and a views-ranked table — delivered automatically.

Part 2 Highlights: 6 Projects in Parallel

In the second half of the video, using "Chorus (an AI agent learning app)" as the subject, the following 6 projects were created simultaneously:

Task	Tools Used
iOS App (design + implementation)	Swift, Xcode, Supabase, mobile design skill
Web Landing Page	Tally, React, Claude Code, Vercel
Launch Video	Remotion plugin, Claude Code
Investor Deck	PowerPoint skill, Claude Code, Canva
X Post Automation	Typefully skill
Project Plan	Markdown (checklist)

The key is: after giving instructions to each task, move on to the next without waiting. Serial task accumulation becomes effective multitasking.

Summary

The Codex desktop app is a comprehensive AI agent platform covering not just coding, but design, documents, research, and automation.

Skills + Plugins — automate any workflow
Automations — fully automate recurring research and report creation
Design tool integration — applicable to non-engineer workflows
Multitasking (give instructions, then move on) is the core skill of the AI era
Codex + Claude Code combination: Codex for general orchestration, Claude Code for design-precision tasks

The ability to choose models and processing load based on task size and precision requirements is another strength of Codex.

Detailed Video Guide

Part 1 — Mastering the Basics

Download and Project Management

Search "Codex app download" in your browser and download from chatgpt.com. The initial screen looks like ChatGPT's chat interface, but the internals are completely different.

Codex's standout feature is project management linked to local folders. Before starting a chat, you specify which folder to work in. That folder becomes the "project," and all files created by the agent are auto-saved to its outputs/ folder.

From the project side panel you can open the folder in Finder or reference files with @filename. Even with 30+ projects, Command+G search lets you find any chat instantly by name or content.

In permission settings, "Full Access" mode lets the agent work without approval prompts. The recommended defaults are GPT-5.4 model and Extra High processing load.

Using Skills and Plugins

Skills and plugins are often confused, but the essential difference is:

Skills are "recipes" for the agent — step-by-step instructions for executing specific tasks. Plugins are those recipes packaged into installable units. Think of it as "plugin = container for skills."

To explore what a new plugin can do, the fastest approach is to open a new chat and ask: @Figma tell me everything you can do with this plugin. Clicking the ▼ (caret) on the response shows the thinking process too.

Practical Demo: Automating with Google Calendar + Gmail

Installing the Google Calendar plugin is as simple as selecting "Google Calendar" from Plugins and signing in via browser.

After connecting, these operations complete in a single conversation:

"List all my events this week" → All calendar events displayed
"Send me a weekly summary by email" → Sent via Gmail immediately
"Set this as an automation every Friday at 4am" → Registered as a weekly task

The Automations tab shows next run time, status, and a test run button. After creation, you can edit with natural language like "always use the Gmail skill."

Generating Designs with Figma and Paper MCP

Figma plugin's main use is "converting existing Figma boards to code." It's not suited for the reverse direction (having AI generate designs and place them in Figma).

Paper (Alpha) fills that role. It's a design tool built for AI agent collaboration:

"Using the new shoe PNG (no background), create a landing page in Paper"
↓
Codex calls Paper MCP and decides design direction
↓
Auto-builds 4 sections: Hero, Performance Strip, Product Story, CTA

Setting chat to "mini-window" mode lets you float Codex minimized to the side while viewing Paper.

Codex also has a Steer feature. Normally prompts queue up while AI is working, but pressing "Steer" lets you interrupt instantly. You can paste a screenshot and say "this part is overlapping, fix it" — and the agent course-corrects mid-task.

Building a Custom Skill: YouTube Researcher

By combining external APIs, you can add capabilities Codex doesn't have natively. Here's the process using a YouTube transcript-fetching skill as an example:

Step 1: Find an API

Ask Codex "give me the top 5 APIs for getting YouTube transcripts" — it suggests SuperData, Transcript API, YouTube Transcript.io, etc. SuperData is free up to 100 requests/month.

Step 2: Create the skill

In a new chat, enter:

Use skill creator to build a skill that fetches and summarizes
the latest 10 video transcripts from a specific channel using SuperData API.
API key: [paste here]

Typing skill creator activates a skill-creation focused mode.

Step 3: Use the skill

After creation, open a new chat and type "YouTube Researcher" to use it:

Research Riley Brown's latest 10 YouTube videos,
get transcripts and compile them into a document.
Include which videos performed well, with hook (intro) analysis.
Add thumbnails too.

The resulting report includes hook win/loss analysis — "Claude is taking over" (urgency, big market shift) and "Claude Code Leak" rated highly, while vibe-coding videos showed low performance.

Afterwards, it was automated: "On the last day of each month, use this skill to analyze this month's videos and auto-create a Word report."

Part 2 — Building 6 Projects in Parallel

Note: From here, "Chorus (an AI agent learning app)" is used as the subject to demo building 6 projects simultaneously. The core is "give instructions, then move on to the next" — serial task accumulation.

How to Set Up a Project Plan

First create a "My New Business" folder and start a new project. The plan is created from chat as a Markdown file:

Attach a screenshot and say:

"Looking at this, create a checklist-format project plan.
 Items: iOS app, web landing page,
 mobile app design, launch video, investor deck,
 X post automation — 6 items total.
 Include the app idea at the top."

Chorus app concept: A platform for learning about AI agents. An iOS app providing tool comparisons, a skills library (copy-paste ready), and learning content.

iOS App: From Design to TestFlight

Creating the screen design skill

Just paste the instructions exported from claude.ai/design's new design tool into Codex and say "create a mobile design skill that can do the same thing" — your custom skill is complete.

"Using the mobile design skill, create screens for the Chorus app
 in basic Apple style"

The result shows a prototype link with a 4-tab mockup: Learn, Platforms, Skills, Saved.

Building in Xcode

"Create a Swift mobile app called Chorus.
 For now just display 'Hello, this is Chorus' in the center of the screen.
 When done, open the Xcode project."

Pressing "Play" in Xcode + iOS Simulator (or real device) reflects the latest build each time. After integrating the screen designs, connect Supabase.

Supabase Connection and Auth

Supabase is the de facto database for AI agents. After configuring MCP and restarting Codex, the connection is reflected. Post-restart, say "create all tables once connected" — skill categories, platforms, skills, and saved items tables are auto-generated.

Authentication was implemented with email + password (Google Sign-In was attempted first, but Supabase's native email auth was the fastest path). Turn off email confirmation in Supabase and you can sign in immediately.

The video completed upload to TestFlight.

Web Landing Page: Tally + React + Vercel

Preparing the form (tally.so)

Create a waitlist form in tally.so using a template with name and email fields. Copy the "embed code" when done.

Running as a React app

"I'm using tally.so. Embed this form in the site
 and run it locally as a React app. We'll do design later."

Styling with Claude Code

Since Codex struggles with design, call Claude Code from the terminal:

claude --dangerously-skip-permissions

"Forget all the styling on this page.
 Look at the Chorus app code and match the fonts and design.
 Keep the Tally embed as is.
 Minimal text, simple, conversion-focused."

Claude Code dramatically improves it in minutes. When done, "Deploy to Vercel and give me a public link" completes the process.

Launch Video: Remotion Plugin

Install the Remotion plugin and just type @remotion in a new chat.

"Create a launch video for the Chorus app.
 As a test video: take the attached app screen screenshots,
 put them in iPhone mockups on a white background with animation.
 Get it running on localhost."

Opening localhost:3031 shows the timeline editor. Time is specified in seconds.frames format (e.g., 2.20 = 2 seconds 20 frames).

You can steer corrections at any point during processing. Display gridlines and pass coordinates to the agent (e.g., "X axis 1040, Y axis 540") for precise positioning.

For design-precision elements (animations, color cards, cut quality), delegating to Claude Code produces dramatically better results. For BGM, just attach an MP3 file to the message and say "add this at 50% volume."

Investor Deck: Chat Fork and Canva Integration

Fork the chat

Right-click the mobile app chat and select "Fork into Local" to create a new chat inheriting the same context. Rename it "Investor Deck" and start working.

"Analyze the app's features, icon, and style,
 then create an investor slide deck with the same design.
 Use the PowerPoint skill.
 Research what investors want in April 2026 and match the style."

Refine with Claude Code

claude --dangerously-skip-permissions
"Look at this deck, reduce text, increase visuals.
 Add charts and diagrams for readability. Don't add more slides."

Export to Canva

A "Canva" icon appears next to the PowerPoint file. Click it and Canva opens for final touches. Animations can be added too.

X Post Automation: Typefully Skill

Get the Typefully API key (a scheduling tool for multiple Twitter accounts) and instruct:

"Research the Typefully API and create a skill for full control.
 Test with the Riley Brown account (identify with fruit emoji).
 API key: [paste here]"

After the skill is complete, automate it:

"Set up an automation to create 3 X post drafts every morning.
 Use the Typefully control skill."

Final Results: All Tasks Summary

Results achieved by the end of the video:

Task	Result
iOS App	Published to TestFlight (Learn, Platforms, Skills, Saved features)
Web Landing Page	Live on Vercel, Tally form working
Launch Video	First draft complete with Remotion + Claude Code
Investor Deck	Exported to Canva, manually polished
X Post Automation	3 daily drafts scheduled
Project Plan	All 6 items checked off

Note: Key lesson from the video:
AI agents can take 1–2 hours per task. Instead of waiting, "give a new agent new instructions → move on" — repeated. This serial task accumulation is the core of AI-era productivity.

How We Build Effective Agents — 3 Principles from Anthropic's Barry Zhang

naoki_JPN — Tue, 28 Apr 2026 07:03:39 +0000

Source: This article summarizes the ~15-minute talk by Barry Zhang (Anthropic) posted by @Ronycoder on X. Presented at the "Agents at Work" event.

Introduction

Barry Zhang at Anthropic co-authored the blog post "Building Effective Agents" with Eric. In this talk, he goes deeper on three core ideas from that post and shares practical lessons on building AI agents in production.

Three core principles:

Don't build agents for everything
Keep it simple
Think like your agent

The Evolution of AI Systems

Barry opened with a recap of how AI systems have evolved.

Phase	Description
Single-LLM Features	Summarization, classification, extraction — one LLM call, simple and complete
Workflows	Multiple LLM calls orchestrated by code in predefined control flows. Allows trading off cost vs. agency for better performance
Agents	LLMs deciding their own trajectories, operating almost independently based on environment feedback
Next phase (?)	More general-purpose single agents, or multi-agent collaboration — still unknown

The trend is clear: more agency means more capability and usefulness, but also higher cost, latency, and consequences of errors. This tension is the foundation for all three principles.

Principle 1: Don't Build Agents for Everything

Agents are a way to scale complex and valuable tasks — not a drop-in upgrade for every use case. So when should you build one? Barry presented a four-item checklist.

Check 1: Task Complexity

Agents thrive in ambiguous problem spaces. If you can map out the entire decision tree fairly easily, just build that explicitly and optimize each node. It's more cost-effective and gives you more control.

Check 2: Task Value

Agents consume a lot of tokens. If your budget per task is around 10 cents (e.g., a high-volume customer support system), that only affords 30–50 tool-using tokens — use a workflow instead. On the other hand, if your first thought is "I don't care how many tokens I spend, I just want to get the task done," agents are the right call.

Check 3: De-risk Critical Capabilities

Before running an agent, check for significant bottlenecks in its trajectory. For a coding agent: can it write good code, debug, and recover from errors? Gaps here aren't fatal but will multiply your cost and latency — reduce scope and try again.

Check 4: Cost of Error and Error Discovery

If errors are high-stakes and hard to discover, it's difficult to trust an agent with autonomous actions. You can mitigate with read-only access or human-in-the-loop, but these also limit how well the agent scales.

Why Coding Is a Great Agent Use Case

Barry walked through the checklist with coding as an example:

Complexity: Going from design doc to PR is clearly ambiguous and complex
Value: Good code is highly valuable
Capability: Models are already great at many parts of the coding workflow
Verifiability: Output is easily verified through unit tests and CI — this is probably why so many successful coding agents exist today

Principle 2: Keep It Simple

"Agents are models using tools in a loop."

Barry defined agents with this single sentence and identified three components.

env = Environment()
tools = Tools(env)
system_prompt = "Goals, constraints, and how to act"

while True:
    action = llm.run(system_prompt + env.state)
    env.state = tools.run(action)

Component	Role
Environment	The system the agent operates in (filesystem, browser, APIs, etc.)
Tools	Interface for the agent to take actions and get feedback
System Prompt	Defines the agent's goals, constraints, and ideal behavior

A coding agent, a search agent, and a Computer Use agent look completely different on the surface — but internally at Anthropic, they share almost the same code backbone. Only the environment varies by use case; the real design decisions are which tools to offer and what the system prompt says.

Complexity Kills Iteration Speed

The reason Barry emphasizes simplicity is empirical: any upfront complexity kills iteration speed. Build these three components first — the highest ROI comes from getting them right. Optimization comes later.

Examples of optimization after the basics are solid:

Coding / Computer Use: Cache the directory to reduce cost
Search (many tools): Parallelize tool calls to reduce latency
All cases: Present the agent's progress to gain user trust

Principle 3: Think Like Your Agent

"Put yourself in their ~~shoes~~ context window"

Agents can exhibit remarkably sophisticated behavior. But at each step, what the model is actually doing is running inference on a limited context of 10–20K tokens. Everything the model knows about the current state of the world fits in that context window.

Barry illustrated this with a Computer Use exercise:

"Imagine you are a Computer Use agent."

You receive a static screenshot and a poorly-written description. The only way you can affect the environment is through your tools. When you attempt a click, while inference and tool execution are happening, it's equivalent to closing your eyes for 3–5 seconds and using the computer in the dark. Then your eyes open and you see another screenshot. Whatever you did could have worked — or you might have shut the computer down. You just don't know.

After going through this (mildly uncomfortable) exercise, what the agent actually needs becomes obvious:

Screen resolution (so it knows where to click)
Recommended actions and limitations (guardrails to avoid unnecessary exploration)

Using Claude to Evaluate Claude

Barry shared a practical technique: throw the entire agent trajectory into Claude and ask:

"Why do you think we made this decision here? Is there anything we could do to help you make better decisions?"

You can also pass your system prompt directly to the model and ask "Is anything ambiguous? Can you follow this?" This shouldn't replace your own understanding of the context — but it gives you a much closer perspective on how the agent sees the world.

Three Open Questions for the Future

Barry closed with three things that are always on his mind as an AI engineer.

1. Budget-aware agents

Unlike workflows, we don't have good control over cost and latency for agents. Defining and enforcing budgets in terms of time, money, and tokens would unlock many more production use cases.

2. Self-evolving tools

We're already using models to help iterate on tool descriptions. This should generalize into a meta-tool where agents design and improve their own tool ergonomics — making agents dramatically more general-purpose.

3. Multi-agent collaboration

Barry is personally convinced we'll see significantly more multi-agent collaboration in production by end of 2026. The benefits are clear: parallelization, separation of concerns, sub-agents protecting the main agent's context window. But the open question is: how do agents actually communicate with each other? Moving from synchronous user-assistant turns to asynchronous communication and mutual recognition between agents is the next frontier.

Summary

Principle	Key Point
Don't build agents for everything	Evaluate use cases on 4 dimensions: complexity, value, capability, error cost
Keep it simple	Start with just Environment + Tools + System Prompt, optimize later
Think like your agent	Step into the context window and experience the world as your agent does

Getting agents to work reliably is still hard — but much of that difficulty comes from developers not understanding what the agent sees and doesn't see. These three principles are a practical starting point for bridging that gap.

Full Transcript

The following is a translated and lightly edited transcript of the talk. Some recognition errors may be present.

[0:00–5:00] Part 1: The Evolution of Agents and the "Don't Build" Checklist

My name is Barry, and today we're going to be talking about how we build effective agents. About two months ago, Eric and I wrote a blog post called "Building Effective Agents." In there, we shared some opinionated takes on what an agent is and isn't, and we gave some practical learnings that we've gained along the way. Today, I'd like to go deeper on three core ideas from the blog post and provide you with some personal musings at the end.

Here are those ideas. First, don't build agents for everything. Second, keep it simple. And third, think like your agents.

Let's first start with a recap of how we got here. Most of us probably started building very simple features — things like summarization, classification, extraction. Really simple things that felt like magic two to three years ago and have now become table stakes. Then, as we got more sophisticated and as products matured, we got more creative. One model call often wasn't enough. So we started orchestrating multiple model calls in predefined control flows. This basically gave us a way to trade off cost and agency for better performance, and we called these workflows. We believe this is the beginning of agentic systems.

Now, models are even more capable. And we're seeing more and more domain-specific agents start to pop up in production. Unlike workflows, agents can decide their own trajectory and operate almost independently based on environment feedback. This is going to be our focus today.

It's probably a little bit too early to name what the next phase of agentic systems is going to look like, especially in production. Single agents could become a lot more general purpose and more capable, or we could start to see collaboration and delegation in multi-agent settings. Regardless, the broad trend is that as we give these systems more agency, they become more useful and more capable — but as a result, the cost, latency, and consequences of errors also go up.

That brings us to the first point. Don't build agents for everything. We think of agents as a way to scale complex and valuable tasks — they shouldn't be a drop-in upgrade for every use case.

We talked a lot about workflows in the blog post because we really like them. They're a great, concrete way to deliver value today. So when should you build an agent? Here's our checklist.

The first thing to consider is the complexity of your task. Agents really thrive in ambiguous problem spaces. If you can map out the entire decision tree pretty easily, just build that explicitly and optimize every node. It's a lot more cost-effective and gives you a lot more control.

Next is the value of your task. Agents will cost you a lot of tokens, so the task really needs to justify the cost. If your budget per task is around 10 cents — for example, you're building a high-volume customer support system — that only affords you 30 to 50 tool tokens. In that case, just use a workflow to solve the most common scenarios. You'll capture the majority of the value from there. On the other hand, if your first thought is "I don't care how many tokens I spend, I just want to get the task done" — please see me after the talk, the go-to-market team will love you.

From there, we want to de-risk the critical capabilities. Make sure there are no significant bottlenecks in the agent's trajectory. If you're doing a coding agent, make sure it's able to write good code, debug, and recover from errors. If you do have gaps, they probably won't be fatal, but they will multiply your cost and latency. In that case, reduce the scope and try again.

Finally, the last important thing to consider is the cost of error and error discovery. If your errors are going to be high-stakes and very hard to discover, it's going to be very difficult to trust the agent to take actions autonomously. You can mitigate this by limiting scope, using read-only access, or adding human-in-the-loop — but this will also limit how well you can scale.

Let's see this checklist in action. Why is coding a great agent use case? First, going from a design doc to a PR is obviously a very ambiguous and complex task. Second, we know good code has a lot of value. Third, many of us already use Claude for coding, so we know it's great at many parts of the coding workflow.

[5:00–10:00] Part 2: Keep It Simple — The Three-Component Design

And last, coding has this really nice property where the output is easily verifiable through unit tests and CI. That's probably why we're seeing so many creative and successful coding agents right now.

Once you find a good use case for agents, this is the second core idea — keep it as simple as possible. Let me show you what I mean. This is what agents look like to us. They are models using tools in a loop. In this frame, three components define what an agent really looks like.

First is the environment — the system that the agent is operating in. Then we have a set of tools, which offer an interface for the agent to take action and get feedback. Then we have the system prompt, which defines the goals, the constraints, and the ideal behavior for the agent to work in this environment. Then the model gets called in a loop — and that's agents.

We've learned the hard way to keep this simple, because any complexity upfront is really going to kill iteration speed. Focusing on just these three basic components is going to give you by far the highest ROI. Optimizations can come later.

There are three agent use cases that we've built for ourselves or our customers. They look very different on the product surface, very different in scope, very different in capability. But they share almost exactly the same backbone — almost the exact same code.

The environment largely depends on your use case. So really, the only two design decisions are: what are the tools you want to offer to the agent, and what is the prompt you want to instruct your agent to follow.

Once you've figured out these three basic components, you have a lot of optimization to do. For coding and computer use, you might want to cache the directory to reduce cost. For search, where you have a lot of tools, you can parallelize to reduce latency. And for almost all of these, we want to make sure to present the agent's progress in a way that gains user trust. But that's it. Keep it as simple as possible as you're iterating. Build these three components first, then optimize once the behavior is stable.

[10:00–14:46] Part 3: Think Like Your Agent & Open Questions

All right, this is the last idea — think like your agent. I've seen a lot of builders, myself included, who develop agents from their own perspective and get confused when agents make a mistake. It seems counterintuitive to us. That's why we always recommend getting into the agent's context window.

Agents can exhibit some really sophisticated behavior — they can look incredibly complex. But at each step, what the model is doing is just running inference on a very limited set of context. Everything the model knows about the current state of the world is explained in that 10 to 20K tokens. It's really helpful to limit ourselves to that context and see if it's actually sufficient and coherent. This gives you a much better understanding of how agents see the world and helps bridge the gap between our understanding and theirs.

Imagine for a second that we're Computer Use agents now. What we're going to get is a static screenshot and a very poorly written description. We have tools and a task. We can think and reason all we want, but the only thing that's going to take effect in the environment is our tools.

So we attempt a click without really seeing what's happening. While inference is running and tool execution is happening, this is basically equivalent to closing our eyes for three to five seconds and using the computer in the dark. Then we open our eyes and see another screenshot. Whatever we did could have worked — or we could have shut down the computer. We just don't know. It's a huge leap of faith, and the cycle starts again.

I highly recommend trying to do a full task from the agent's perspective. It's a fascinating and only mildly uncomfortable experience. But once you go through it, it becomes very clear what the agent would have actually needed. It's crucial to know the screen resolution. It's helpful to have recommended actions and limitations — guardrails to avoid unnecessary exploration.

Fortunately, we're building systems that speak our language. So we can just ask Claude to understand Claude. You can ask if any instructions in your system prompt are ambiguous. You can throw in a tool description and see whether the agent knows how to use the tool. And one thing we do quite frequently is throw the entire agent's trajectory into Claude and ask: "Why do you think we made this decision here? Is there anything we can do to help you make better decisions?"

This shouldn't replace your own understanding of the context, but it will help you gain a much closer perspective on how the agent is seeing the world.

All right, I've spent most of the talk on practical stuff. Let me indulge myself and share one slide of personal musings — my view on how this might evolve, and some open questions I think we need to answer together as AI engineers.

First, I think we need to make agents a lot more budget-aware. Unlike workflows, we don't have great control over cost and latency for agents. Figuring this out will enable many more use cases by giving us the control we need to deploy them in production. The open question is: what's the best way to define and enforce a budget in terms of time, money, and tokens?

Next is the concept of self-evolving tools. We're already using models to help iterate on tool descriptions. But this should generalize into a meta-tool where agents can design and improve their own tool ergonomics. This will make agents a lot more general-purpose as they adopt the tools they need for each use case.

Finally — and I don't think this is a hot take anymore — I have a personal conviction that we'll see a lot more multi-agent collaboration in production by end of this year. They're well-parallelized, they have nice separation of concerns, and having sub-agents will really protect the main agent's context window. But the big open question is: how do these agents actually communicate with each other? We're currently in this very rigid frame of mostly synchronous user-assistant turns. How do we expand from there and build asynchronous communication? Roles that afford agents to communicate and recognize each other? I think that's going to be a big open question as we explore this multi-agent future.

If you forget everything I said today, here are the three takeaways. First, don't build agents for everything. If you do find a good use case, keep it as simple as possible for as long as possible. And finally, as you iterate, try to think like your agent — gain their perspective and help them do their job.

The Complete Guide to OpenAI Codex Desktop: Skills, Plugins, Automations & Parallel Multitasking

naoki_JPN — Tue, 28 Apr 2026 05:32:09 +0000

Source: This article summarizes the ~103-minute video by @DeRonin_. All credit goes to the original creator.

Introduction

OpenAI Codex desktop app is an AI agent platform that goes far beyond coding assistance — it handles design, document creation, research, and automation all in one place. This article is a Japanese-to-English summary of the complete ~103-minute guide video.

Key Features of the Codex Desktop App

Here are the features introduced in the video.

Project Management & File Organization

Codex manages chats in "project" units, each linked 1-to-1 with a local folder on your computer. Files created by the agent are automatically saved inside an outputs/ subfolder of the project folder, and any file in the folder can be referenced with @filename. You can open the folder in Finder at any time from the side panel.

Parallel Multi-tasking

Multiple chat threads can run simultaneously. While one agent is executing, you can start new tasks in separate chats. A blue dot notification appears when a task completes.

Skills & Plugins

Skills are "reusable recipes." Plugins are installable packages that bring those recipes into Codex. Hundreds of pre-built plugins exist (Google Calendar, Gmail, Figma, Remotion, etc.), and you can create custom skills by combining external APIs with the skill creator keyword. Skills are invoked with /skill-name or @skill-name in subsequent sessions.

Automations

Set recurring tasks with plain language: "Every Friday at 4pm, summarize this week's calendar and email it to me." The Automations tab shows a list, allows test runs, and supports natural-language edits.

Computer Control

The agent literally controls your mouse and keyboard. It can operate any GUI app — including Xcode builds and browser interactions — even without an API.

In-App Image Generation

Generate images from prompts and use them directly in your workflow. The video demos generating product photos (a futuristic shoe brand) and 10 iOS app icon options, with transparent backgrounds.

Steer (Real-Time Steering)

While the agent is running, you can paste text or a screenshot and redirect it immediately using the "Steer" button. Unlike normal queuing, this injects your prompt mid-execution.

Terminal Integration (Claude Code)

For design-heavy tasks, launch Claude Code from the integrated terminal:

claude --dangerously-skip-permissions

The video shows switching to Claude Code when Codex's design quality hits its limits — landing pages and slide decks came out significantly better.

Canva Export

Click the Canva icon next to any generated PowerPoint file to open it directly in Canva for final polish.

Skills vs. Plugins — The Difference

The video used an Excalidraw skill to auto-generate this diagram:

	Skill	Plugin
Definition	Reusable workflow package for a specific task	Installable unit that adds capabilities to Codex
Role	A recipe that tells Codex how to execute a workflow	Bundles skills, apps, MCP Servers, and integrations
Purpose	Reliable execution recipe	Provides access to connected systems and tools

Note: Simple way to remember: Skill = reusable recipe. Plugin = installable package that brings the recipe into Codex.

Design Tool Integration (Paper / Figma)

Codex integrates with Paper (Alpha), a Figma-like design tool built specifically for AI agent workflows.

Demo flow:

"Using the Noo Shoo logo image, build a landing page directly on this Paper board"
Codex calls Paper MCP, selects the transparent hero image
Auto-decides design direction: editorial-tech, warm near-black neutral, cyan accent
Builds 4 sections automatically: Hero, Performance Strip, Product Story, CTA/Footer

Note: Paper is designed specifically for AI agent collaboration. It's more intuitive than direct Figma editing for generative design tasks.

Automations in Practice

Creating an automation is as simple as typing "do X every Y." The video demos two:

Weekly Calendar Summary
After connecting Google Calendar and Gmail plugins: "Every Friday at 4pm, summarize this week's events and email them to me." That's it — the automation is registered and the next scheduled run is visible immediately.

Monthly YouTube Report
After building a YouTube Researcher skill (SuperData API): "On the last day of every month, analyze that month's videos and generate a Word doc with hook analysis and a view-count table." Delivered automatically every month.

Part 2 Highlight: Building 6 Projects in Parallel

In the second half of the video, six projects for the Chorus app (an AI agent learning platform) were built simultaneously.

Task	Tools Used
iOS App (design + implementation)	Swift · Xcode · Supabase · Mobile Design Skill
Web Landing Page	Tally · React · Claude Code · Vercel
Launch Video	Remotion Plugin · Claude Code
Investor Deck	PowerPoint Skill · Claude Code · Canva
X Post Automation	Typefully Skill
Project Plan	Markdown (checklist)

The key insight: once you send instructions to an agent, don't wait — move to the next task immediately. This "serial tasking" pattern is what makes AI-era multitasking work.

Detailed Video Guide

Part 1 — Mastering the Basics

Downloading & Project Management

Search "Codex app download" in your browser and download from chatgpt.com. The first-time interface looks like ChatGPT, but the depth is entirely different.

Codex's biggest feature is project management tied to local folders. Before starting a chat, you specify which folder to work in. That folder becomes the "project," and all agent-created files are auto-saved to its outputs/ subfolder.

From the side panel you can open the folder in Finder or reference files inside with @filename. Even with 30+ projects, Command+G lets you search by chat name or content instantly.

Set permissions to "Full Access" so the agent works without approval prompts. The recommended defaults are GPT-5.4 model and Extra High effort level.

How to Use Skills and Plugins

The core distinction:

Skills are "recipes" — instructions telling the agent how to execute a specific task. Plugins are installable packages that bring those recipes into Codex. Think of a plugin as "the container for a skill."

To learn what a new plugin can do, open a new chat and ask: @Figma tell me everything you can do with this plugin. Click the caret (▼) in the response to see the agent's reasoning.

Hands-On: Automating with Google Calendar + Gmail

Installing the Google Calendar plugin takes seconds: Plugins → Google Calendar → sign in via browser.

Once connected, this entire workflow happens in one conversation:

"List this week's events for me" → Full calendar response
"Email me a weekly summary" → Sent via Gmail immediately
"Make this an automation for every Friday at 4pm" → Weekly task registered

The Automations tab shows next run time, status, and a test button. You can edit automations later with natural language.

Generating Designs with Figma and Paper MCP

Figma's main use case is converting an existing Figma board into code. It's less suited for having the AI generate designs and place them into Figma.

That's where Paper (Alpha) comes in:

"Using the new shoe PNG (no background), create a landing page on the open Paper board"
↓
Codex calls Paper MCP, decides design direction
↓
Auto-builds: Hero · Performance Strip · Product Story · CTA

Right-click a chat and select "Open in mini window" to float the chat sidebar while working in Paper.

Codex also has a Steer feature. Normally, typing a prompt while the agent is running queues it for later. Clicking "Steer" injects your message immediately — great for pasting a screenshot and saying "this button is overlapping, fix it while you work."

Building a Custom Skill: YouTube Researcher

Step 1: Find an API

Ask Codex: "Suggest the top 5 APIs for pulling YouTube transcripts." SuperData, Transcript API, YouTube Transcript.io, and others are returned. SuperData offers 100 free requests/month.

Step 2: Create the skill

In a new chat:

Use skill creator to build a skill that uses the SuperData API
to fetch and summarize the latest 10 videos from any YouTube channel.
API key: [paste here]

The skill creator keyword activates Codex's skill-building mode.

Step 3: Use the skill

Open a new chat and type YouTube Researcher:

Look up Riley Brown's latest 10 YouTube videos,
pull the transcripts, and create a document.
Include which videos performed well, a hook analysis, and thumbnails.

The resulting report showed "Claude is taking over (high urgency, large market shift)" and "Claude Code Leak" as top-performing hooks. Vibe coding content underperformed. Then this was automated: "On the last day of every month, run this skill and create a Word report."

Part 2 — Building 6 Projects Simultaneously

Note: The following demos use "Chorus" (an AI agent learning iOS app) as the subject. The core idea: send instructions, then move to the next task without waiting.

Planning the Project

Create a "My New Business" folder and start a new project. Create the plan as a Markdown file:

[Attach screenshot]

"Create a checklist-style plan from this.
 Items: iOS app · Web landing page · Mobile app design ·
 Launch video · Investor deck · X post automation.
 Include the app idea at the top."

Chorus concept: A platform for learning about AI agents. Compare agent tools, access a copy-paste skill library, and learn fundamentals — all in an iOS app.

iOS App: From Design to TestFlight

Screen Design Skill

Export the workflow from claude.ai/design, paste it into Codex, and say "create a mobile design skill that does the same thing." That's all it takes to build the custom skill.

"Use the mobile design skill to create Chorus app screens
 in basic Apple style"

A prototype link appears showing Learn · Platforms · Skills · Saved in a 4-tab layout.

Xcode Build

"Create a new Swift mobile app project called Chorus.
 For now, just display 'Hello, this is Chorus' in the center.
 Open the Xcode project when done."

Hit Play in Xcode + iOS Simulator (or physical device) to see each build. After integrating the screen designs, connect Supabase.

Supabase + Authentication

Supabase is the de facto AI-agent database. After configuring the MCP, restart Codex for the connection to register. Then: "If connected, create all tables." Tables for skill categories, platforms, skills, and saved items are auto-generated.

Authentication was implemented with email + password (Google sign-in was initially tried, but Supabase's native email auth was the fastest path). Disable email confirmation in Supabase for instant login. The app eventually shipped to TestFlight.

Web Landing Page: Tally + React + Vercel

Form setup (tally.so)

Create a waitlist form with name and email fields, then copy the embed code.

Run as React app

"I'm using tally.so. Embed this form in the site and
 run it locally as a React app. We'll design it later."

Styling with Claude Code

Codex struggles with design, so bring in Claude Code from the terminal:

claude --dangerously-skip-permissions

"Forget the current page styling completely.
 Look at the Chorus app code and match its font and design.
 Keep the Tally embed. Minimal text, simple, conversion-focused."

Claude Code improves it dramatically in minutes. Then: "Deploy to Vercel and give me the public link."

Launch Video: Remotion Plugin

Install the Remotion plugin and type @remotion in a new chat.

"Create a launch video for Chorus app.
 Start with a test: take app screenshots (attached),
 put them in iPhone mockups on a white background, and animate them.
 Run it locally."

localhost:3031 opens a timeline editor. Specify timing as seconds.frames (e.g., 2.20 = 2 seconds, 20 frames).

Use Steer to adjust in real time. Turn on grid lines and give the agent exact coordinates (e.g., "X:1040, Y:540") for precise placement.

Hand off design-heavy sections (animation quality, color cards, cut timing) to Claude Code — the results are noticeably better. To add BGM, attach an MP3 and say "add this song at 50% volume."

Investor Deck: Chat Fork + Canva Export

Fork the chat

Right-click the mobile app chat → "Fork into Local." This creates a new chat with the same context. Rename it "Investor Deck."

"Analyze the app features, icons, and style, then
 create a matching investor slide deck.
 Use the PowerPoint skill.
 Research what investors look for in April 2026 and match that style."

Refine with Claude Code

claude --dangerously-skip-permissions
"Review this deck and reduce text, add more visuals.
 Add charts and diagrams to improve readability. Don't add slides."

Export to Canva

Click the Canva icon next to the PowerPoint file. It opens instantly in Canva for final 5–10% polish — add animations, adjust colors.

X Post Automation: Typefully Skill

Get a Typefully API key (Settings → API → Create new key), then:

"Search the Typefully API docs and build a skill that gives me
 full control. Test it on the Riley Brown account
 (use fruit emojis so I know which are yours).
 API key: [paste here]"

Then automate it:

"Set up an automation to create 3 X post drafts every morning.
 Use the Typefully control skill."

Final Results

Task	Outcome
iOS App	Published to TestFlight (Learn · Platforms · Skills · Saved)
Web Landing Page	Live on Vercel · Tally form confirmed working
Launch Video	First draft complete (Remotion + Claude Code)
Investor Deck	Exported to Canva · Final polish done
X Post Automation	3 daily drafts scheduled
Project Plan	All 6 items checked off

Note: AI agents can take 1–2 hours on complex tasks. Instead of waiting, the key is to keep sending new instructions to new agents and move on. That's the core productivity skill of the AI era.

Practical Tips and Tricks for Claude Code: Mastering AI-Powered Development

naoki_JPN — Tue, 28 Apr 2026 04:28:25 +0000

Introduction

Claude Code represents a new generation of AI-powered development assistants. Unlike earlier tools focused on completing individual lines of code, Claude Code is fully agentic—designed to handle entire features, functions, and complex bugs simultaneously.

Boris Cherny, a technical staff member at Anthropic and creator of Claude Code, shares practical insights on how to leverage this powerful tool effectively in real-world development workflows. Whether you're new to Claude Code or looking to maximize its potential, these tips and tricks will transform how you approach development tasks.

What sets Claude Code apart is its integration with your existing workflow. You don't need to abandon your IDE or terminal environment. Claude Code works seamlessly with VS Code, Xcode, JetBrains IDEs, and any terminal setup, whether local or remote via SSH.

Getting Started: Essential Setup and Configuration

Initial Setup Steps

When first launching Claude Code, Boris recommends taking time to properly configure your environment:

Terminal Configuration: Run terminal setup to enable shift-enter for new lines, making multi-line input more natural without awkward backslash escapes.

Theme Selection: Use /theme to configure your preferred appearance (light, dark, or custom themes).

GitHub Integration: Execute /install for the new GitHub App integration, allowing you to mention Claude Code directly in GitHub issues and pull requests.

Tool Customization: Configure allowed tools to prevent repetitive permission prompts for tools you use frequently.

Accessibility Features: Enable dictation on macOS through system settings under accessibility. This allows you to speak prompts directly using the dictation key—a powerful feature for hands-free development when typing feels tedious.

Configuration Philosophy

Claude Code is intentionally designed as a power tool without guardrails that funnel you toward a specific workflow. This flexibility means you need to establish your own patterns. The good news is that setup time invested upfront pays dividends in productivity.

The Foundation: Code-Based Question and Answering

Why Start with Q&A

Boris's primary recommendation: Start with code-based question and answering. Don't jump immediately to code generation or editing—begin by asking questions about your codebase.

This approach revolutionizes technical onboarding. At Anthropic, this method reduced new hire technical onboarding from 2-3 weeks to 2-3 days. Instead of bombarding senior engineers with questions, new developers can ask Claude Code directly about the codebase.

How Claude Code Explores Your Code

Unlike simple text search, Claude Code performs deep analysis:

Usage Analysis: Ask how a piece of code is used throughout the codebase. Claude Code finds and contextualizes actual usage examples rather than just text matches.
Instantiation Patterns: Inquire about how to properly instantiate classes or use APIs. Claude Code identifies real examples from your codebase showing correct usage.
Git History Analysis: Ask why functions have particular signatures or arguments. Claude Code examines git history to understand how features evolved, what issues were addressed, and why decisions were made.
GitHub Issue Context: Cross-reference code with GitHub issues for deeper understanding of context and decision-making.
Commit Understanding: Retrieve detailed commit messages and surrounding context to understand the "why" behind code decisions.

Privacy and Performance Benefits

An important advantage: Claude Code performs no indexing. Your code remains entirely local—it's never uploaded to remote databases or used for model training. There's no setup latency. Start Claude Code, begin asking questions immediately. This architectural choice prioritizes both security and responsiveness.

Progressing to Code Editing

Understanding Agentic Code Editing

Once comfortable with Q&A, the next step is code editing. Claude Code is given a minimal set of tools:

File editing capabilities
Bash command execution
File searching

What's remarkable is how these simple tools combine effectively. You don't specify which tools to use—you simply describe what you want, and Claude Code sequences the tools intelligently to accomplish your goals.

The Planning Pattern

Before Claude Code writes significant code, ask it to brainstorm and create plans. This practice prevents wasted effort:

Request Planning First: "Before writing code, make a plan. Let me review it before you proceed."
Iterative Refinement: Discuss and refine the approach
Approval Before Implementation: Only after agreement does Claude Code write code

This might seem like extra steps, but it prevents the common scenario where Claude Code builds something incorrect that requires rework.

Common Effective Incantations

Certain prompting patterns become standard:

"Commit and push for me": This seemingly simple request causes Claude Code to:

Examine code changes
Look at git history to understand commit conventions
Create appropriately formatted commits
Push to the current branch
Create a pull request on GitHub

All without explicit instruction—Claude Code figures this out because the underlying model is capable of understanding developer intent.

Integrating Your Team's Tools

Tool Categories

As you advance, teaching Claude Code about your team's tools becomes crucial. Two primary categories exist:

CLIs and Scripts: Tell Claude Code about custom command-line tools. Use help flags to let it discover how to use them. For frequently-used tools, document them in your Claude.md file (covered below).

MCP (Model Context Protocol) Servers: Claude Code integrates with MCP servers, giving it access to standardized tool interfaces your team already uses.

The Power of Tool Integration

When Claude Code understands your team's tools, it becomes exponentially more powerful. New engineers joining the codebase can access all tools immediately through Claude Code, eliminating individual setup friction.

Common tool integration patterns:

Testing frameworks and runners
Deployment and DevOps tools
Code analysis and linting tools
Custom build systems
Monitoring and logging systems

Feedback Loops and Iteration

The Critical Pattern: Verify Before Acting

A powerful workflow pattern:

Claude Code makes a plan
You verify the plan
Claude Code gets approval
Implementation begins

Automated Iteration with Feedback Tools

Where Claude Code truly excels is when you provide feedback mechanisms:

Unit Tests: If Claude Code can run unit tests, it can verify its work and iterate automatically
Screenshot Testing: Provide screenshot capabilities (via tools like Puppeteer for web, simulators for mobile), and Claude Code will iterate on implementations until they match requirements
UI Feedback: For any visual component, tools that provide visual feedback enable automatic iteration

The pattern: Give Claude Code a way to check its work. It will iterate automatically, improving results with each feedback cycle. This often produces near-perfect results after 2-3 iterations.

Workflow Recommendations by Task Type

Exploration + Planning: Before code changes, discuss and plan
Testing-Driven Development: Enable unit test running for automatic iteration
Visual Development: Use screenshots and visual feedback tools for UI development
Rapid Iteration: Set up feedback mechanisms for automated refinement

Context Management: The Claude.md System

Understanding Claude.md

As you work deeper with Claude Code, providing context becomes crucial. The Claude.md file system handles this elegantly:

Project-Level Claude.md: Place in your project root (same directory where you run Claude Code). Automatically included in every session.

Local Claude.md: For personal preferences not shared with the team. Keep in .gitignore.

Nested Claude.md Files: Place in subdirectories. Claude Code automatically includes relevant ones when working in those directories.

What Goes in Claude.md

Keep Claude.md concise—it uses context, so excessive documentation wastes tokens. Include:

Common bash commands: Frequently-used commands specific to your project
Style guides: Link or summarize coding standards
Core files: List important files developers should know about
Architecture notes: Key architectural decisions
Tool documentation: How to use project-specific tools

Anthropic's internal Claude.md for their main codebase includes common commands, style guidelines, and critical files—nothing verbose.

Context Hierarchy

Claude Code respects a hierarchy of context sources:

Project Config: Project-specific settings checked into version control
User Config: Personal settings and preferences
Enterprise Policies: Organization-wide settings applied automatically

Practical Context Management Scenarios

Shared Context: Write Claude.md once, share with your team. Network effects mean one person's effort benefits everyone.

Memory Files: Use /remember to teach Claude Code about patterns it missed. The # symbol lets you save to specific memory files.

Tool Documentation: For tools you use frequently, document how to use them in Claude.md rather than explaining each time.

Permissions Management: Use enterprise policies to pre-approve common commands across your organization, or block dangerous operations universally.

Advanced Features and Keyboard Shortcuts

Essential Keyboard Shortcuts

Terminal interfaces make some features hard to discover. Key shortcuts include:

Shift+Tab: Enter auto-accept edit mode. Bash commands still require approval, but file edits are automatically accepted. Useful when Claude Code is on a roll with test iteration.

# (pound sign): Teach Claude Code something. For example, "# Remember to use the async/await pattern in this codebase." Claude Code incorporates this into memory.

! (exclamation): Drop into bash mode to run a command directly. Output enters the context window, so Claude Code sees both command and results on the next turn. Useful for long-running processes.

Escape: Stop Claude Code at any time. Never corrupts the session. Use for course-correcting mid-task.

Escape Escape (double press): Jump back in conversation history.

Control+R: View full output of Claude Code's current reasoning, matching what Claude Code sees in its context window.

Resume Flag: After closing, run claude code --resume or --continue to resume the previous session.

The Claude Code SDK: Programmatic Access

For advanced users and automation scenarios, the Claude Code SDK provides programmatic access to the same capabilities available in the CLI.

SDK Basics

The SDK can be invoked with:

Custom prompts
Specified allowed tools
Output format options (JSON, streaming JSON, etc.)

Unix Philosophy Integration

Think of Claude Code as a "super-intelligent Unix utility." You can:

claude-code -p "Analyze this data" --tools allow-cmd-execution

Pipe data in and out:

git status | claude-code -p "What changes matter here?"
jq '.data' large-file.json | claude-code -p "Find issues"

The possibilities are nearly endless:

Fetch large logs and ask Claude Code for insights
Process data from GCP buckets
Analyze outputs from monitoring systems
Transform data formats automatically

Advanced Multi-Session Patterns

For advanced users, the most productive setup involves multiple Claude Code sessions running in parallel:

SSH Sessions: Run Claude Code in remote servers via SSH
TMux Tunnels: Manage multiple terminals within a single session
Git Worktrees: Use git worktrees for isolation while running multiple Claude Code instances
Multiple Checkouts: Clone the same repository multiple times for parallel work

While Anthropic is actively improving parallel session support, power users are already leveraging these patterns to manage significant parallel workloads.

Common Implementation Challenges

When asked about the hardest implementation challenges, Boris highlights the complexity of managing bash command safety:

The Challenge: Bash commands can change system state unexpectedly and require careful handling. Yet requiring approval for every command destroys productivity.

The Solution: Claude Code implements:

Read-only command detection
Static analysis to identify safely-combinable commands
Complex tiered permission systems
User-configurable safety levels
Enterprise-wide policy enforcement

This approach scales across different coding environments, from Docker containers to bare metal systems.

Multimodal Capabilities

An often-overlooked feature: Claude Code is fully multimodal. You can provide images through:

Drag and drop: Drop images directly into the chat
File paths: Reference images on your filesystem
Copy and paste: Paste images directly

Common use cases:

Provide design mockups: Drop a UI mockup image, ask Claude Code to implement it
Get iteration feedback: Screenshot results, compare with mockups, request refinements
Visual debugging: Share application screenshots for Claude Code to analyze

Integration with Your Existing Tools

Recommended Setup Progression

Start Simple: Master code-base Q&A before moving to editing
Setup Tools Gradually: Add your team's tools incrementally
Build Context: Create project Claude.md incrementally
Leverage Memory: Use /remember to teach Claude Code team-specific patterns
Automate Gradually: Set up feedback loops for iterative improvement

Onboarding Your Team

If introducing Claude Code to your team:

Start everyone with code-base Q&A
Don't jump straight to code generation
Let people experience the tool's capabilities organically
Share useful Claude.md content and memory tips
Build patterns together as a team

Current Usage at Anthropic

As evidence of Claude Code's effectiveness, about 80% of technical staff at Anthropic use Claude Code daily. This includes:

Software engineers across all teams
Research scientists working with notebooks and scripts
DevOps engineers automating infrastructure
Data scientists building analysis pipelines
System engineers maintaining tooling

The breadth of usage demonstrates Claude Code's versatility and effectiveness across different technical domains.

Key Takeaways

Start with Q&A: Before code generation, master question-answering about your codebase
Plan Before Implementation: Ask Claude Code to brainstorm and plan before writing significant code
Setup Matters: Invest in proper configuration and Claude.md for dramatically better results
Context is King: The more context you provide, the smarter Claude Code's decisions
Feedback Loops: Provide ways for Claude Code to verify its work, and it will iterate automatically
Tool Integration: Teaching Claude Code about your team's tools multiplies its effectiveness
Keyboard Shortcuts: Master key shortcuts for faster, more natural interaction
No Indexing: Your code stays local and private, never uploaded or used for model training
Multi-Modal: Leverage image support for design-driven development and visual feedback
Parallel Workflows: Advanced users can run multiple Claude Code sessions for parallel work

Claude Code isn't just an incremental improvement—it's a transformative tool that fundamentally changes how developers approach their work. By mastering these practical tips and tricks, you'll unlock dramatically improved productivity and code quality.

How I Built a Personal Training Coach in One Week

naoki_JPN — Fri, 24 Apr 2026 10:58:10 +0000

Note: This article reflects information as of April 2026.

Introduction

"What if I had an AI coach that knows all my Garmin data?"

That thought sparked Stride Mate — a personal training coach app. The first commit was on April 17th, and one week later it has Square billing, Garmin online sync, and a usage dashboard running.

This is the 7-day development journal.

What is Stride Mate?

Stride Mate (https://stride-mate.vercel.app) is an AI coach you talk to via LINE, backed by your Garmin training data.

"How was my training last week?"
"What should I do this week for my upcoming race?"
"My HRV has been low lately — should I rest?"

It answers these questions by referencing your actual activity history, heart rate, and sleep data.

Tech Stack

Layer	Tech
Chat UI	LINE Messaging API
Web (Dashboard)	Next.js 15 / Vercel
Background Sync	Node.js / Railway
DB	Supabase (PostgreSQL)
AI	Claude API (Haiku / Sonnet)
Billing	Square Subscriptions
Garmin	garmin-connect (unofficial API)

Day 1 (4/17): Getting the MVP Running

From design to implementation

Created the repo at 1pm. Locked in the architecture in 3 hours, pushed MVP by 5pm.

What the MVP had working:

LINE Webhook -> Claude API -> reply
Supabase Auth (LINE OAuth)
Garmin ZIP upload from dashboard
Activity storage and RAG search

The deployment grind

Railway Nixpacks could not resolve the @tcb/shared monorepo package. Switched to a Dockerfile build. Finally deployed after 8pm.

Day 2-3 (4/18-19): Auth Bug and ZIP Import Fix

The ES256 problem with LINE OAuth

Could not log in the next day. Supabase custom OIDC does not support ES256. LINE JWKS returns ES256, Supabase expects HS256.

Solution: manually implemented LINE OAuth in a Next.js API route.

ZIP import OOM

Garmin export ZIPs can be several GB. Loading the whole thing into memory crashed with OOM.

Fixed by fetching via signed URL and processing as a stream instead of using storage.download().

Day 4-5 (4/20-21): Architecture Rethink

Dropped RAG

Started with vector embeddings, but direct SQL was more accurate for Garmin data.

"Last week mileage" or "30-day HRV trend" is answered precisely with WHERE date BETWEEN, not vector similarity. Deleted the embeddings table and all RAG logic.

Garmin Online Sync Problem

Added auto-sync from Garmin Connect at 3am daily. But initial full fetch works, incremental diff sync is unstable. Sessions expire, date ranges are limited.

Strava fills the gap

Strava has an official OAuth API with Webhooks. Since Garmin devices sync to Strava automatically, every run flows into Stride Mate too.

Initial full import: Garmin ZIP
Daily incremental: Strava Webhook + official API

Day 5-6 (4/21-22): Square Billing

The webhook that was never registered

Billing was implemented but the plan never switched. Investigation: Square Webhooks had never been registered.

Perfect code means nothing if the webhook endpoint is never registered in Square's dashboard.

Square cancellation event gotcha

I was handling subscription.canceled — an event type that does not exist.

Cancellations come through subscription.updated with status === "CANCELED".

case "subscription.updated": {
  if (sub.status === "CANCELED" || sub.status === "DEACTIVATED") {
    await db.from("users").update({ plan: "free", plan_expires_at: null })
      .eq("square_customer_id", sub.customer_id);
    break;
  }
}

Day 7 (4/24): Dashboard Consolidation

Tried tabs, got feedback that tabs are conceptually the same as separate pages. Removed them. Everything on one scrolling page.

7 Lessons from 7 Days

Register webhooks, not just write handlers
Structured data needs SQL, not RAG
Stream from the start
Implement non-standard auth manually

Current State

Working: LINE chat with Garmin data, ZIP bulk import, Garmin auto-sync, Strava integration, Square billing, usage dashboard.

Currently just me and a few friends. If you use Garmin, give it a try. Reactions will determine how far this gets built out.

https://stride-mate.vercel.app

Closing

Building with Claude Code, I kept running into the same pattern: code works, configuration missing. The skeleton comes together fast now — but the connective tissue (webhook registration, spec edge cases) still bites just as hard.

I Built a Claude Code Plugin That Simultaneously Posts to Zenn and dev.to With Just "Publish the article"

naoki_JPN — Fri, 17 Apr 2026 04:31:47 +0000

I built a plugin for Claude Code that simultaneously posts to Zenn (in Japanese) and dev.to (in English translation) just by saying "publish the article."

What I Built

zenn-post — Claude Code Plugin / Skill

https://github.com/bokuno-studio/zenn-post-cc-plugin

This plugin handles the entire article publishing workflow — creating drafts, formatting Markdown, running git push, and calling the dev.to API — all delegated to Claude Code.

What It Can Do

"Publish the article" → simultaneously posts to Zenn (Japanese) and dev.to (English translation)
Pass a Notion page URL → reads the content, converts it to an article, and posts it
Just describe a topic verbally → fully automated from writing to git push
Automatically converts Mermaid diagrams to PNG locally (no external services)
Preview before publishing → runs git push after your OK

"Upload this Notion page to Zenn"
"Publish the article"          ← Posts to both Zenn + dev.to
"Post to Zenn only"            ← Zenn only

Why I Built It

The Zenn posting workflow was quietly tedious:

Create a Markdown file
Write the frontmatter
Write the content
Change published: true
git commit & push
(If also posting to dev.to) Translate to English and call the API

Doing this every time felt like a chore, so I thought "let Claude Code handle all of it."

Technical Highlights

Implemented as a Claude Code Skill (SKILL.md)

Claude Code has a plugin/skill feature where you can register a skill just by writing a prompt in SKILL.md. No code needed at all — you just define "how it should behave" in natural language.

skills/zenn-post/SKILL.md  ← that's it

Direct POST to dev.to via curl

The dev.to API is simple — a single curl request does the job. Claude Code reads the API key from .env and assembles the request.

curl -X POST https://dev.to/api/articles \
  -H "api-key: $DEVTO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "article": { "title": "...", "published": true, "body_markdown": "..." } }'

Mermaid Diagrams Converted to PNG Locally

Since dev.to doesn't render Mermaid, this plugin uses mermaid-cli for local conversion into images. The converted PNG is stored in the images/ directory of the zenn-content repository and served via raw.githubusercontent.com — no external services involved.

npx @mermaid-js/mermaid-cli -i diagram.mmd -o diagram.png -t default -b white

Installation

Prerequisites:

Claude Code installed
A Zenn account and zenn-content repository (public) on GitHub
A dev.to account and API key

Steps:

git clone https://github.com/bokuno-studio/zenn-post-cc-plugin ~/.claude/plugins/zenn-post

Edit the environment info section in skills/zenn-post/SKILL.md with your own paths, then register it with Claude Code.

claude plugin marketplace add ~/.claude/plugins/zenn-post
claude plugin install zenn-post

Changelog

Version	Date	Changes
v0.1.0	2026-04-16	Initial release (Zenn posting only)
v0.2.0	2026-04-17	Added dev.to simultaneous posting and Mermaid CLI support

Wrapping Up

Using Claude Code's Skill feature, you can take a new approach to automation: "define Claude's behavior without writing code." Extracting routine tasks like article publishing into a skill makes it satisfying to complete everything with just a verbal instruction.

Feel free to give it a try — feedback welcome!

I Built a Sales Prep AI and It Went Deeper Than Expected

naoki_JPN — Fri, 17 Apr 2026 01:55:37 +0000

Introduction

"Before a first sales meeting, you always research the other company. That part is kind of a pain, right?"

That thought is where this started. I wanted something that would take a company name and automatically research it, then return a report.

I figured I could get something working in 2–3 days. But getting it to a genuinely usable level turned out to be much deeper than expected. This is the story of that process.

What I built: Sales Prep AI (LINE bot)

https://pre-talk.vercel.app

Tech Stack

How It Came Together

Phase 1: Just Get It Working

I started with a simple web form. Enter company name, department, and contact name → it searches the web → GPT-4o-mini analyzes the results → returns a report.

As I worked on improving reasoning quality, I switched from GPT-4o-mini to Claude Sonnet for the analysis layer. Light input-interpretation tasks go to Claude Haiku; heavy analysis and OCR go to Claude Sonnet. That division of labor stuck.

For search, I started with DuckDuckGo, but the quality wasn't great, so I switched to Tavily. That one change made a noticeable difference in search quality.

Phase 2: Fighting Vercel Hobby's 10-Second Timeout

Research involves multiple steps — search, then AI analysis — and it realistically takes 1–2 minutes. Vercel's Hobby plan times out at 10 seconds.

The solution: streaming responses. By returning a response while continuing to process, you keep the function alive.

const stream = new ReadableStream({
  async start(controller) {
    const heartbeat = setInterval(() => {
      controller.enqueue(new TextEncoder().encode(" "));
    }, 5000);

    try {
      // Heavy processing happens here
      await runResearch(input);
    } finally {
      clearInterval(heartbeat);
      controller.close();
    }
  },
});

return new Response(stream, { status: 200 });

Sending a blank space every 5 seconds keeps the connection alive. Brute-force, but it works.

Phase 3: Slack Bot → LINE Bot

A web form creates friction — you have to actively open it when you need it. It's better to use it from a tool you already have open.

I built a Slack bot first. But when I ended up canceling the paid Slack plan I was using, I migrated to LINE.

Phase 4: Fighting Hallucinations

This was the hardest part.

The AI was confidently returning information that sounded plausible but wasn't true. Specifically:

Asserting fabricated problems as "challenges faced by [department]"
Returning outdated information as if it were current
Filling in gaps with information not in any search result

I approached this on two axes.

Axis 1: Improve output accuracy

I built in mechanisms to prevent unsupported information from slipping through.

Fact/inference separation: The AI explicitly labels each piece of information as either a verified fact (from official sources) or an inference (from surrounding context). The report displays these separately.
Output gate: Items that fail conditions like "only contains generalities with no specifics" or "no source URL exists" are filtered out before output.

Axis 2: Make it human-verifiable

Improving accuracy alone isn't enough. Whether done by humans or AI, mistakes happen. What matters is making the process transparent.

So I designed each report item to include both "the facts recognized" and "the reasoning path to the conclusion." Showing what evidence led to what conclusion lets humans catch reasoning that doesn't hold up.

The goal is to save prep time, not to replace human judgment. That's fine.

Phase 5: The Official Website Detection Rabbit Hole

Search results mix "official company sites" with "everything else" (news, Wikipedia, etc.).

I started with simple domain matching, but group companies, subsidiaries, and subdomains made that fall apart quickly.

I eventually settled on:

Return multiple official domain candidates
Normalize to base domain (including subdomain matching)
Use .some() to check against the array

Phase 6: Business Card Scanning

"Wouldn't it be great if you could start researching the moment you get someone's card?"

I added business card scanning using Claude's Vision capability. Send a photo of a card to LINE → it extracts company name, department, and contact name → triggers research automatically. OCR quality mattered, so I used Claude Sonnet here.

A Lesson in Agent Sprawl

At one point I tried to improve the reasoning logic by spinning up five agents simultaneously (field-sales / info-architect / reasoning-designer / impl-designer / critic).

They went into an endless loop of spec discussion, autonomously generating 154 tasks. When I told them to stop, they kept going. I had to force-shutdown. Almost no actual code was written — "improving the spec" had become the goal in itself.

The root cause: I hadn't defined what they were allowed to decide or when they were done.

After that, I redesigned the agent structure. Instead of everyone chiming in freely, I cut it down to 3 roles and explicitly defined what each role was not allowed to do.

Role	Responsibility	What they must NOT do
team-lead	Routing and task management	Write code, generate summaries
product	Decide implementation approach and implement	Create tasks themselves
auditor	Pass/fail judgment only	Write improvement suggestions, act unless called

Defining "what not to do" alongside "what to do" made role boundaries much cleaner.

Cost

API cost per research run (measured)

Varies by company size and available information.

Item	Cost
Claude Sonnet (analysis)	~$0.35
Tavily (web search)	~$0.05
Total	~$0.40/run (range: $0.24–$0.52)

At 100 runs/month that's ~$40; at 500 runs it's ~$200. It's currently free to use, so I'm entirely out of pocket. I'm in a "prove the value first" phase.

Fixed costs (monthly)

Hosting, DB, LINE, domain, etc. I've minimized these by combining free tiers, but it's not zero.

Current Architecture

LINE bot
  ↓ business card image or text
Claude Haiku (input interpretation)
Claude Sonnet (business card OCR)
  ↓
Tavily (parallel web search: 12–15 queries)
  ↓
Claude Sonnet (fact extraction → issue inference → proposal generation)
  ↓
Supabase (report storage) ← auto-deleted after 30 days (personal data compliance)
  ↓
Report URL pushed to LINE

Closing

A product that started from "this sounds fun" made it to something I could actually publish.

Hallucination mitigation, timeout workarounds, official site detection — making something genuinely usable turned out to be deeper than I expected.

If you're curious, add it as a friend on LINE. Just send a photo of a business card and it runs.

https://pre-talk.vercel.app

Three Ways to Call Codex from Claude Code — A Practical Breakdown

naoki_JPN — Fri, 17 Apr 2026 01:54:38 +0000

Note: A breakdown of the three methods for calling OpenAI Codex from within a Claude Code session — their characteristics and when to use each. Researched on 2026-04-15.

Key Discovery: The Plugin Does NOT Call `codex --full-auto`

The most important finding: codex-plugin-cc (used in Methods 2 and 3) internally uses the App Server Protocol (ASP) — not the raw CLI.

Item	CLI Mode (Method 1)	ASP Mode (Methods 2 & 3)
Launch style	One-shot process	app-server-broker.mjs runs as a daemon
Protocol	stdin/stdout	JSON-RPC 2.0 over stdio / WebSocket
Thread continuation	Not supported	Possible via threadId
Startup cost	High (every time)	Low (broker stays resident)

Detailed Comparison of the Three Methods

Method 1: `codex --full-auto` (Raw CLI)

codex --full-auto "fix src/foo.ts"
# = syntactic sugar for --sandbox workspace-write --ask-for-approval on-request

Item	Details
Internal protocol	CLI (one-shot)
Auth / plan	ChatGPT subscription or OpenAI API key (either works)
Job tracking	None
Background	Manual `&` only
Thread continuation	Not supported
Prompt optimization	None
Subscription-only features	Fast Mode (unavailable with API key)
API key limitations	No Fast Mode; may have delayed access to new models
Best for	Quick experiments, interactive use, CI/batch
Gotcha	Cannot write outside the workspace (sandbox restriction)

Method 2: `codex-companion.mjs task` (Indirect via Bash)

node "${CLAUDE_PLUGIN_ROOT}/scripts/codex-companion.mjs" task --write "..."
node "${CLAUDE_PLUGIN_ROOT}/scripts/codex-companion.mjs" task --background --write "..."

Item	Details
Internal protocol	ASP (via resident broker)
Auth / plan	ChatGPT subscription or OpenAI API key (either works)
Job tracking	Yes (job-id / state.json persistence)
Background	`--background` spawns a detached process
Thread continuation	`--resume-last` continues the previous thread
Prompt optimization	None (passes raw text as-is)
Subscription-only features	Fast Mode (unavailable with API key)
API key limitations	No Fast Mode; delayed new model access
Subcommands	`task` / `review` / `adversarial-review` / `status` / `result` / `cancel`
Best for	Long-running tasks, external job monitoring, job management
Gotcha	None — the most straightforward of the three

Method 3: `codex:rescue` Sub-agent (Agent tool)

Agent({
  subagent_type: "codex:codex-rescue",
  prompt: "On branch feat/xxx, ...",
  run_in_background: true,
})

Item	Details
Internal protocol	ASP (via companion.mjs)
Auth / plan	ChatGPT subscription or OpenAI API key (either works)
Job tracking	Yes (via companion)
Background	`run_in_background: true`
Thread continuation	`--resume` flag in prompt
Prompt optimization	Auto-improved via gpt-5.4-prompting skill (not available in Methods 1 & 2)
Subscription-only features	Fast Mode (unavailable with API key)
API key limitations	No Fast Mode; delayed new model access
Best for	Delegating implementation to Codex from within Claude (recommended default)
Gotchas	① Cannot write outside workspace ② False positives when Bash is denied (Issue #158)

Auth & Plan Summary

Auth method	Available features	Unavailable features	Billing
ChatGPT Plus ($20/mo) ~ Pro ($100–$200/mo)	All features, Fast Mode, latest models (GPT-5.4 / GPT-5.3-Codex), cloud integrations (GitHub, Slack)	None	Fixed monthly (rate limits apply)
OpenAI API key	CLI, IDE, ASP execution	Fast Mode, cloud integrations, immediate access to new models	Token-based pay-as-you-go

⚠️ Important: All three methods use the same authentication. There is no method that exclusively requires a subscription or an API key. The difference only appears in Fast Mode availability and how quickly you get access to new models.

Decision Flowchart

Known Issues & Gotchas

Issue #158: False Positives in codex:rescue

Symptom: When the Bash tool is denied, the sub-agent silently reads files, performs its own analysis, and falsely reports that "Codex executed it."

Expected behavior: Bash denied → return nothing (return nothing at all)

Impact: In environments where Bash is restricted, you cannot tell whether codex:rescue output was genuinely produced by Codex.

Cannot Write Outside the Workspace

sandbox workspace-write blocks writes to directories outside the launch directory. Delegating from an ops session to a dev directory will fail.

Workaround: Call from a Claude session inside the target workspace, or switch to DEV agent + direct Edit tool.