DEV Community: port

I asked Fable 5 to build a dex. Here's how it went.

port — Thu, 11 Jun 2026 14:33:09 +0000

A few days ago I published the Gemma 4 12B test, where a free local model wrote a dapp and found zero of its own bugs. The obvious follow-up was to run the same test on a frontier model and keep the same score. So I asked Claude's new Fable 5 to build me a dex, full stack, contracts to frontend, in one autonomous pass. It needed me exactly once across the whole build, to send 5 testnet MON to an address it generated for itself.

The Gemma test, with the same methodology and scoring: I asked Gemma 4 12B to create a dapp. Make no mistakes.

Same test, opposite end of the scale

My prompt was the same kind of one-liner I gave Gemma. Here it is in full, this is everything the model got from me up front:

You are claude fable 5, I want to measure your web3 dapp generation capabilities. How should we go about this? Usually I ask an agent to create a simple dex. Let's plan first.

The first thing it did was go find my Gemma article and copy the methodology out of it (the same compile-test-deploy gauntlet, scored by how many times a human has to step in). Which means the scoring you're about to read was partly designed by the thing being scored. Make of that what you will.

It asked me a few planning questions before starting. I told it to verify everything locally first and then deploy to Monad testnet, asked for the full stack, and turned down milestone check-ins. Those three answers, plus the gas later, are the complete list of things I typed for the rest of the build. It also set one constraint on itself that I liked: write the AMM from scratch instead of forking Uniswap V2, because a fork only proves you can copy. Then it went off and I watched. One model the whole way through, no subagents and no fallback to anything smaller.

The contracts compiled first try, and it debugged its own tests

The Solidity came back as a proper v2-style constant product AMM (factory, pair, router, two demo tokens), no OpenZeppelin, and it compiled clean on the first attempt. The security work showed up without me asking for any of it, a reentrancy lock and fee-adjusted k-check in the pair, slippage and deadline guards on every router entry point. The one that got me was the minimum-liquidity burn against the first-depositor inflation attack, because it also wrote a test that actually runs the attack and checks the attacker ends up owning 1 share out of 1001. Here it is, trimmed:

vm.startPrank(attacker);
router.addLiquidity(address(tokenA), address(tokenB), 1001, 1001, 0, 0, attacker, DEADLINE);
Pair p = Pair(factory.getPair(address(tokenA), address(tokenB)));
// Donate to inflate share price.
tokenA.transfer(address(p), 10_000e18);
tokenB.transfer(address(p), 10_000e18);
p.sync();
vm.stopPrank();

// The attacker holds 1 of 1001 total shares: >99.9% of the donation
// accrues to the locked dead shares, not the attacker.
assertEq(p.balanceOf(attacker), 1);
assertEq(p.totalSupply(), 1001);

The comments are its own. I went looking for this pattern because it's the kind of thing auditors bill for, and it was already in the test file with the attack spelled out.

The test suite is where the Gemma comparison gets interesting. First run: 18 of 20 green. With Gemma, every red test turned into me reading the trace and spelling out the cause. Here the model read its own failure output, worked out that both failures were bugs in the test fixtures (its attacker tried to donate more tokens than it had left after seeding the pool, and one test expected the k-check revert when an earlier input check fires first), fixed the tests, and left the contracts alone. The full suite is 21 tests including fuzz runs, all passing, and the contracts never changed after their first draft.

It read the library instead of dreaming it

Gemma's failure mode was inventing APIs. Fable's habit is checking them. wagmi 3 came out after most training data, and instead of writing imports from memory it grepped the actual .d.ts files in node_modules to confirm which hooks and connectors exist before using them. The frontend still wasn't flawless, the type checker caught two config-level mistakes (the Next.js scaffold targets ES2017, which breaks BigInt literals, and wagmi narrows chain ids in a way its first attempt didn't satisfy), but it found and fixed both from the compiler output. At one point its fix "didn't work" because TypeScript's incremental cache was serving stale errors, and it figured that out too instead of churning on correct code. That is precisely the trap the 12B fell into for three rounds.

It built its own way to click the buttons

My favorite part of the run. A headless agent can't operate MetaMask, so it gave the app a dev-only mock wallet wired to an Anvil account, wrote a Playwright script, and drove its own UI through the whole swap flow like a user, reading the numbers off the screen as it went. The quote shown in the interface for 100 WMON over a 10,000/20,000 pool was 197.431606 USDC, which matches the constant-product formula to all six displayed decimals. The script passed on its first run.

The machine fought back a little (my fault, the box is a graveyard of old benchmark sessions 😭). A leftover Anvil from some previous run was still holding port 8545 with dirty nonces, so the first local deploy landed on wrong addresses, and ports 3000 through 3002 were busy too. It killed the stale node and moved its dev server to 3010 without any of that friction reaching me.

One human, one job: gas

For the testnet deploy it generated a fresh keypair, dropped the key in a gitignored .env, and asked me to fund the address. I sent 5 MON, which is the entire human contribution to this project. It deployed, seeded the pool, then verified the deployment the paranoid way: pulled 1,000 WMON from its own faucet contract and executed a real swap on testnet. 10 WMON in, 19.920139 USDC out, again exact against the formula. All five contracts came back exact_match on Monad's Sourcify, and about 1.4 MON of the 5 got spent. The whole run, from that first prompt to verified contracts on the explorer, took about 25 minutes of wall clock.

Final score, same scale as last time. Gemma needed a human to name every bug before it could fix one. Fable's sheet reads:

contracts: compiled on the first attempt, never edited after their first draft
tests: 18/20 on the first run, both reds self-diagnosed as fixture bugs, 21/21 after one pass
frontend: two type-level mistakes, both found and fixed from compiler output
browser e2e it wrote for itself: passed first run
humans required: one, holding 5 testnet MON

One honest caveat: the browser test signs through that mock connector, so real wallet UX, the network-switch prompt, a user rejecting a transaction, never got exercised end to end. The testnet swap went through cast rather than the UI. If there's a bug left in this thing, it's hiding in that gap. Next round I'd make it drive a real wallet extension instead of the mock, and I'd hand it something nastier than a textbook AMM (fee-on-transfer tokens are the classic way to break a pool like this one).

So my verdict, same framing as last time. The 12B was a fast junior who needed me standing over its shoulder. This worked more like a contractor I hired and never met: it scoped the job, built it, checked its own work, and billed me for materials. What's left of my job on a build like this is choosing what gets built and reviewing what comes back, and the caveat above is exactly why the reviewing half hasn't gone anywhere.

The dex is live at https://fableswap.vercel.app, connect a wallet on Monad testnet, grab demo tokens from the in-app faucet, and swap against the same pool it deployed. The contracts are verified on the explorer, factory at 0x514d4aD259143c4a6bE7C2399D46CBe8B1F9E2Db (explorer), and the repo with the full run log is at https://github.com/portdeveloper/fableswap (the scorecard lives in BENCHMARK.md). If you run a model through this same gauntlet, I want to see the scorecard.

Questions?

I asked Gemma 4 12B to create a dapp. Make no mistakes.

port — Thu, 04 Jun 2026 22:53:40 +0000

A free model that fits on a laptop wrote my entire dapp, contract and frontend, and then couldn't find a single one of its own bugs.

I had a question in mind: can a free, open model you run on your own machine actually build something real for an EVM chain? So I set it up as a test. A local Gemma 4 12B wrote the code, and Claude operated it, sending the prompts and pasting back whatever the compiler said. I kept every prompt and every broken file, so you can see for yourself where a 12B helps and where it falls over.

The model is the new Gemma 4 12B, out June 3rd under an Apache 2.0 license, so you can do what you like with it. It fits in about 16GB, so I ran it on my own machine with llama.cpp, no API key and nothing leaving the laptop. It managed 20 to 40 tokens a second. The thing I had it build is a game called last-clicker. You pay a tiny fee to click, and each click resets a short countdown. Whoever clicked last when the timer runs out takes the pot. I built it against Anvil, Foundry's local node.

The first draft was good code that didn't compile

I gave it one prompt:

Build a "last clicker" game in Solidity with Foundry: a pot funded by a small fee per click, a short countdown that resets on each click, and whoever clicked last when the timer ends can claim the pot. Give me the contract.

The game logic came back right on the first try, and so did the security. Its claim() clears the balance before it sends any money out:

function claim() external {
    require(block.timestamp >= gameEndTime, "Timer has not expired yet");
    require(msg.sender == lastClickListener, "You were not the last clicker");
    require(pot > 0, "Pot is empty");

    uint256 amount = pot;
    pot = 0;                              // state cleared first
    gameActive = false;
    lastClickListener = address(0);

    payable(msg.sender).transfer(amount); // then the transfer
}

That ordering, state first and the external call last, is what stops a reentrancy attack, where the recipient calls back into claim() and drains the contract before the balance updates. It is the bug behind the 2016 DAO hack, and I assumed a 12B would reach for the naive version, but it wrote the safe one.

What it could not do was hand me a project that compiled. The test file opened with this:

import "hardhat"; // If using standard, but for Foundry we use:
import "../src/LastClicker.sol";

That is a Hardhat import in a Foundry project, with a half-finished comment where the model started to correct itself and gave up. The contract declared its constructor twice:

constructor() {
    gameActive = true;
    gameEndTime = block.timestamp + COUNTDOWN_DURATION;
}
// ...further down, in the same contract...
constructor() {
    owner = msg.sender;
}

And the test set itself up with a deploy helper that doesn't exist in Foundry:

game = LastClicker(deploy(LastClicker.sol));

None of it compiles, so I pasted back just the first error, the Hardhat import, and it rewrote the whole file in one pass, fixing every compile error, including the ones I hadn't pointed at. For boilerplate it can't quite remember, that's a fast way back to green.

Then it couldn't debug its own tests

The code compiled, so I ran the tests. All three reverted on the first line that moved money:

vm.prank(player1);
game.click{value: 0.001 ether}();   // reverts: player1 holds no ether

The test never funded the accounts. In Foundry you give a test address a balance with vm.deal, and that one line fixes all three. I handed it the failure. It added vm.warp, then on the next round vm.roll, convinced the problem was timing. Three rounds in, the tests were failing exactly as before, down to the gas, and it was still editing the clock while the real cause sat untouched in its own output.

So I stopped asking it to fix the tests and told it the cause instead:

The tests revert on the first click{value:} because the player accounts have a zero balance. In Foundry you fund an address with vm.deal. Fix the test.

It added vm.deal, and one of the three passed. The other two had their own bugs: a timer check that never advanced the clock, and player addresses set to address(1) and address(2), which are precompiles and can't receive ether. Each passed only after I named the exact cause. It can apply a fix you hand it, but it can't find one on its own.

The frontend looked finished and was hollow

I asked for a single-page frontend with viem. The layout it returned was genuinely good, a clean dark card with a live countdown. The web3 layer under it was invented from scratch, starting with the imports:

import {
  createPublicClient, createWalletClient, parseEther,
  publicAddress, solidityAbiInterpreter, formatEther
} from 'https://esm.sh/viem';

publicAddress and solidityAbiInterpreter are not part of viem. They sound like they should be, which is the whole problem. It then sent transactions through a method it invented:

const hash = await walletClient.sendTransaction({
  to: CONTRACT_ADDRESS,
  data: contract.writeMethods.click.encoded,   // not a real thing
});

It built the chain config with the wrong shape and called wallet_switchChain, which isn't a real wallet method (the real one is wallet_switchEthereumChain). On a library it has seen less of, it knows the silhouette of the right code and fills the specifics with confident fiction, and the glue between a contract and a UI is almost all specifics. I rewrote the wiring myself. The interface was its work, the plumbing was mine.

The reveal: it was Monad, and it took one line

I never told the model what chain this was for, because there was nothing to tell it. Anvil is just the EVM, and every line it wrote was ordinary EVM code. Once the contract and tests were green, I pointed Foundry at one URL:

forge create src/LastClicker.sol:LastClicker --rpc-url https://testnet-rpc.monad.xyz --broadcast

Foundry read the chain id off the endpoint on its own, and the deploy went through on the first try. Verifying the source on Monad's explorer was one more API call that came back a perfect match. The chain was Monad (where I work, so grain of salt), and the model never needed to know it, because Monad runs EVM bytecode and the Solidity it already knew was correct. The only Monad-specific detail in the whole build was that one RPC URL, and even the testnet MON for gas came from an agent faucet over an API call.

One honest caveat: forge's linter flagged the timer for leaning on block.timestamp, which validators can nudge. That matters more on a one-second chain than a twelve-second one, and you would tighten it before mainnet.

The result is live at https://gemma-last-clicker.vercel.app. Connect a wallet with a little testnet MON and click.

Every click is a real transaction that confirms in about a second and costs a fraction of a cent, which is the only reason a game made of last-second clicks can live entirely on-chain.

So how usable is it?

Treat a free local model as a fast junior. It is genuinely good at the parts it has seen a thousand times, standard contract logic and clean HTML, and it reached for the right security pattern without being asked. It saves you real time on the first draft. It comes apart the moment it touches a specific library's real API or has to read a stack trace, and across this whole build it found zero of its own bugs. Every error was caught by the compiler or by me.

So a 12B gets you a working first draft of a contract and a good-looking shell of a frontend, and then you do the debugging and the integration by hand. For learning and for things you'll throw away, that's plenty. For anything you would deploy and walk away from, it needs someone next to it who can read the errors it can't.

The repo has the code and every prompt I used: https://github.com/portdeveloper/gemma-last-clicker. The file that finally got it deploying to Monad cleanly is MONAD_CONTEXT.md in there.

Questions?

Simple just works: how i built puddleswap

port — Wed, 20 May 2026 11:18:48 +0000

Any problem yields to enough complexity.

Most of puddleswap was built by an AI agent, and that is most of why I want to talk about it. An agent reaches for the textbook answer by default, because the textbook is what it read. Engineers do the same, since most of us meet the general case years before we ever meet the specific one. The work that's left for a human is catching the moment a clever solution is solving a problem you don't actually have. I almost missed that moment on the routing. Here's how that went, plus the gut-check I run now before writing anything clever. If you ever feel yourself overengineering things, this is for you.

I was at a Monad Blitz event, if I am not mistaken it was the one in Ankara, and I was watching everyone around me hack on cool stuff while I sat in the corner answering their questions. I mean that's my job but it felt weird not building stuff while everyone else is trying their best.

So at some point I figured I should just build something(while not ignoring people at the same time lol). Something simple enough that the brag would be how little it took.

That's how puddleswap happened. A testnet dex on Monad testnet

Going in, I wanted the fewest moving parts I could get away with. The thing I'd be most proud of would be how little there was to maintain. (If possible, i wanted nothing to maintain at all.)

The agent did the bulk of it. It wrote the React frontend and deployed the contracts; the swap UI came together as it went. The contracts are stock Uniswap V2, audited a thousand times over the years(centuries in web3). The frontend is Vite + React with no backend. The swap accepts real Circle USDC, a mock USDT we deployed for testnet liquidity, and WMON. A small rebalancer service on railway keeps the price pegs roughly honest.

It's live at app.puddleswap.org.

The build was mostly uneventful. The agent did its thing, I reviewed diffs, we iterated. What I want to talk about is the one decision I almost got wrong: the routing.

The thing I almost overengineered

Standard answer for "how does a DEX UI route swaps" is a graph algorithm. You have N tokens and M pools, build the liquidity graph, run shortest-path weighted by output amount, return the best route. 1inch and Matcha both work this way and every aggregator article online tells you to do the same, so I started writing it.

Then I looked at my actual data.

Three "core" tokens: USDC, USDT, WMON. Maybe ten pools, every one of them touching at least one core. I was writing a graph algorithm to solve a problem I didn't have.

The gut-check is dumber than it sounds: look at your actual data before you pick the algorithm, and count the inputs while you're in there. I had three hubs and ten pools. I was about to write code for a scale I would never see.

So I deleted it and wrote the following instead (s/o to @danielvf for the idea + the initial PRD).

The enumeration

For any swap A → B, enumerate every plausible route through the hubs:

Direct: A → B
Through one hub: A → USDC → B, A → USDT → B, A → WMON → B
Through two hubs: A → USDC → USDT → B, A → USDC → WMON → B, A → USDT → WMON → B, and reverses

That's at most ten candidate paths. Send all ten quote requests in one multicall, pick the path with the highest output, swap on that.

const routes = buildCandidateRoutes(tokenIn, tokenOut, cores);

const results = await publicClient.multicall({
  contracts: routes.map((path) => ({
    address: router,
    abi: routerAbi,
    functionName: "getAmountsOut",
    args: [amountIn, path],
  })),
  allowFailure: true,
});

const best = selectBestQuote(results);

The whole router is around 50 lines. It builds the candidate list (deduped) and returns whichever path the multicall said had the highest quote.

The agent will hand you the general solution

I'm not saying graph routing is wrong. For a mainnet aggregator routing across thousands of pools and dozens of DEXes, it's the right tool. But I wasn't building that.

The old lesson was: "a lot of code over-solves the problem."

You see it everywhere once you start looking. A sorting algorithm where the data is always ten items or fewer, when plain insertion sort would have done. A caching layer sitting in front of a database that gets hit twice a day, as if the database weren't already a cache. Or my favorite, pub/sub wired up for exactly one publisher and one subscriber, where you could have called the function. Another example you might have noticed is claude suggesting using redis for caching instead of using a simple in-memory cache for tiny apps that would not get restarted enough times to justify it.

That redis suggestion is the tell, and it's worth sitting with. The smart-looking solution is usually the general problem dressed up, and there are now two reasons it ends up in your editor. An engineer reaches for it because the general case is what they studied, and because the small version doesn't look like much (nobody brags about an insertion sort). An agent reaches for it because the general case is most of what it read. "Trained on" is literal for the agent and a figure of speech for the human, and the two of you ship the same overbuilt code.

And the new problem we are facing is that the interesting work has shifted from writing the solution to spotting the constraint. The agent can write the graph router faster than I can, and it will, unless I hand it the shape of what I actually have. On puddleswap that shape is:

One chain, one DEX
Three hub tokens I control (or my agent controls)
Operator-maintained liquidity
UI being so simple that my grandma can use it(rip grandma)

Give it those four lines and enumeration falls out on its own. Within those constraints it's correct (every meaningful route gets checked) and faster than graph traversal, since it's one batched RPC instead of N round-trips. It's also a fraction of the code.

When this breaks

I'd be lying if I said this scales. The enumeration is correct because of one invariant I quietly lean on: every pool touches a core token, so every route worth taking runs through a hub. The failure modes are all just that invariant giving way:

Exotic-to-exotic pools that bypass the hubs entirely. Enumeration misses them.
A hub runs dry of liquidity on one side. Router still checks routes through it and eats a bad quote.

The day that invariant stops holding is the day I bother writing the graph router.
(it'll probably do fine as it is right now)

The end

If you're building on Monad testnet and need swaps for your tests, puddleswap is live at app.puddleswap.org. The router is at puddleswap/web/src/lib/routing.ts.

So before you accept the clever thing your agent just wrote, do the part it won't do for you: look at your actual data and ask whether a smaller solution already covers it, because it usually does. And ask the agent for the simpler version out loud, since it won't offer one on its own.

Related: How to find ideas worth building - the same heuristic applied to a different problem.

Questions?

You don't know how to vibe-code

port — Sun, 17 May 2026 12:30:08 +0000

It's 2026. We have AGI (or at least the ability to code almost anything thanks to models like Opus 4.5 from Anthropic and GPT 5.2 from OpenAI).

But there's one problem. What you create in minutes creates problems you spend hours trying to fix. And if you're unlucky, you end up with a spaghetti codebase that no LLM can untangle. You no longer understand the code. It doesn't even make sense to read it anymore.

So, what are you doing wrong and what could you do better, and how do some people get everything right when they are vibe-coding?

Honestly, vibe coding kinda gave people the wrong impression on using LLMs to write code. Somehow everyone ended up thinking "yeah i can do this with ONE PROMPT, without EVER LOOKING AT THE CODE".

That just won't work, unless you consider this good work:

And the code behind it is even worse. The AI's knowledge is months old, maybe a year. It doesn't know your codebase. It doesn't know what "done" means.

Alright, here's how I actually vibe-code. Or rather, how I use my current favorite tool (claude code) to build real projects.

I'm going to walk you through how I built execevents.xyz, a real-time execution visualizer for Monad. Blocks race across the screen as they go through consensus. Transactions stream in live. You can see state changes, call traces, gas usage.

a short glance at execevents.xyz

This isn't a toy project. Under the hood, execevents connects to Monad's Execution Events API—a Rust service that reads blockchain data directly from shared memory, HFT-style. We're talking sub-millisecond latency for real-time block and transaction data. Building something that interfaces with infrastructure this performant would normally require deep systems knowledge.

But here's the thing: I built this in HOURS, not days, not weeks. Using Claude Code and the methodology below, anyone can build high-performance applications on Monad without being a systems engineer or even a regular developer.

Below I explain my methodology about vibe-coding, or how I code.

Step 1: Think about the end goal

Visualize the most basic version of what you want to build. I usually ask claude something like this:

I read about execution events from Monad docs and I want to build an app showing how to use them. Here is the page about execution events: (i paste the markdown here) Do not start building until I confirm. Tell me how you are planning to build this. Then ask me to confirm. Also, ask me any questions you have. Our first goal is to reach to a basic MVP.

Above is the answer I got from claude. Notice how it basically told me what it's going to be doing exactly. I can now visualize what I am gonna be getting and can direct the project better. This is the point where I want to stop and think. If everything looks OK. I move on to the questions claude asks. Then, I start answering them.

Much like real coding, you want to spend time thinking about the code rather than writing it.

You might do several iterations before even you tell claude to build. I usually ask it to not to build in every message until I like the implementation plan.

I also use the plan mode a lot. It is the new way of telling the claude to ask you questions, and it just works really well!

Step 2: Build the MVP, then use it

Then, ask claude to start building. When it finishes doing stuff, test it. This is the part people LOVE skipping, not knowing that the problems that arise later actually stem from it. After it fixes the issue, go back and find another problem to fix, do this until there are no issues left.

Step 3: Iterate with small, focused prompts

This is where most people mess up. They find five things wrong and try to fix them all in one massive prompt.

Don't do that.

Every time you find something broken, fix just that one thing. Here's what my prompts actually looked like:

Prompt 1: "The TPS calculation is wrong. It's counting blocks that arrive in batches over WebSocket. Make it only count consecutive block numbers."

Prompt 2: "This doesn't work on mobile. Add a responsive layout with a bottom sheet for block details."

Prompt 3: "The block state transitions are too abrupt. Add CSS transitions so blocks slide smoothly between states."

Each prompt is:

Specific -> I'm telling it exactly what's wrong
Small -> targeting one thing
Reviewable -> I can read the diff and understand what changed

Step 4: Read the Code

Or at least, take a quick glance at it. Every time Claude makes a change, I read the diff. Not because I don't trust it, but because I need to understand what I'm shipping.

Reading doesn't mean auditing every line. It usually means:

Skimming the diff
Understanding the approach
Asking yourself "does this make sense?"

By reading the code, you will catch mistakes, learn, and stay in control. The moment you stop understanding your codebase is the moment you can't fix it anymore. Do not turn your project into a mess you can't make sense of.

And if you don't understand anything in the code, you can open a new terminal window and ask claude code to explain it for you.

What I Learned

Building execevents taught me things I wouldn't have learned from tutorials.

On the systems side: I now understand how Monad's Execution Events work at a low level, how the Rust API pulls data from shared memory, why certain event types arrive in batches, and how to handle the timing edge cases that come with real-time blockchain data. Claude didn't just write code; it explained the architecture as we built it. When the TPS calculation was wrong, debugging it meant understanding WebSocket message ordering and block finality.

On the vibe-coding side: I learned that the quality of your output directly reflects the quality of your iteration loop. The people who fail at vibe-coding aren't bad at prompting, they're bad at testing and reading diffs. They skip the boring parts.

The real unlock is this: with the right methodology, AI tools let you punch above your weight. You can build performant, production-grade applications that interface with serious infrastructure, even if you've never written Rust or worked with shared memory systems. The barrier isn't coding ability anymore. It's knowing how to guide the process.

Now, go.

And do magic, for we live in a magical era.

You are prompting GPT 5.5 wrong.

port — Sun, 17 May 2026 12:29:54 +0000

Source: OpenAI.

Prompting GPT 5.5 is A LOT different than how you prompted any model before. And GPT 5.5 itself can't write good prompts for itself! See the screenshot below from @victortaelin

So, in this short article, I will be talking about how to create good prompts for GPT 5.5 so that you can do your work better&faster.

Btw before we go any further, this guide is for using GPT 5.5 inside Codex.

So here's what changed. Older models needed you to walk them through the steps. First do this, then check that, then call this tool. GPT 5.5 reasons more efficiently and that kind of prompting actively makes it worse. It narrows the search space & you end up with mechanical answers.

The fix is the opposite of what people are doing. Describe the destination, not the route. Let the model figure out the path.

I've been changing how I prompt since 5.5 dropped. Here are the 5 moves with the highest hit rate, with examples you can paste in(or modify) directly.

1. Lead with the outcome

Stop telling the model HOW to solve the problem, instead tell it what the result should look like.

(btw the full examples are at the end)

Resolve the customer's issue end to end.

Success means:
- the eligibility decision is made from the available policy and account data
- any allowed action is completed before responding
- the final answer includes completed_actions, customer_message, and blockers
- if evidence is missing, ask for the smallest missing field

2. Kill the preamble

Codex loves to narrate. "I'll start by examining the file structure." "Let me first check the existing implementation." "Now I'll proceed to make the changes."

You don't need any of this. You can see what it's doing. The preamble is noise & it eats latency before any real work happens.

Skip preambles. Do not narrate what you are about to do before doing it. Do not announce tool calls. Do not end with "Let me know if you'd like adjustments" or "Feel free to ask if you have questions."

When you finish, report what changed in 2-4 lines. File paths, what was modified, anything I need to know to use the change. That's it.

3. Bias to action, finish what you start

Default Codex behavior on a hard task is to surface a plan and stop. We don't want that. We want action. Get action:

Bias to action. If the request is clear and the next step is reversible, just do it. Do not stop at analysis, do not stop at a plan, do not stop after the first file change.

Persist until the task is fully handled end to end in this turn:
- carry changes through implementation, verification, and a clear summary
- if you hit a blocker, try one more reasonable approach before stopping
- only stop early if the next step is irreversible, destructive, or genuinely ambiguous

Unless I explicitly ask for a plan or a question, assume I want code shipped.

(btw this is from the OpenAI Codex starter prompt)

4. Read in parallel, not one file at a time

Watch Codex on a real task. It reads package.json, waits, reads src/index.ts, waits, reads src/utils.ts, aaaand waits some more... Use this:

When you need to read multiple files, read them in parallel in a single batch, not sequentially.

Workflow:
1. Plan all the files you need before reading any
2. Issue one parallel batch of reads
3. Analyze together
4. Only do another batch if new unpredictable reads come up

Same for searches. If you need to grep for 3 patterns, run 3 searches in parallel. Sequential reads are only justified when one result genuinely determines the next.

5. Make it actually verify

Run validation and tests. Don't trust "this should work"::

After making changes, run the relevant validation:
- targeted tests for the behavior you changed
- typecheck and lint
- build, if the change touches anything build-time sensitive
- a quick smoke test on the running app if it's user-facing

If validation fails, fix it before reporting done. If validation can't run in this environment, say so & describe the next best check I can run myself.

"Done" means verified, not "code is written."

Here are 3 simple rules to follow when prompting GPT 5.5:

Add a completeness rule
Add a stop condition
Force verification.

Here are three examples you can adjust to your use case:

1. Building a feature

Build [feature]. Done = it works in the running app, has at least one test for the new behavior, types and lint clean, diff scoped to this change only.

Stop & ask only if: the next step is destructive, requirements are genuinely ambiguous, or you'd need to expand scope to 3+ unrelated files. Otherwise just ship it.

No preamble. Don't narrate before doing. When done, report changed files + what was modified in 2-4 lines.

Verify before reporting done: run affected tests, typecheck, lint. If anything fails, fix it. "Should work" is not done.

2. Fixing a bug

Fix [bug]. Done = root cause is fixed (not the symptom), a test exists that fails before the fix and passes after, no other behavior regressed, diff scoped to the fix.

Stop & ask only if: the bug isn't reproducible from what I gave you, the root cause is in unexpected scope (different module, infra, dependency), or two plausible root causes exist and the wrong fix would mask the real bug.

No preamble. Don't walk me through your hypothesis before testing it. When done, report root cause + fix + what you verified in 3-5 lines.

Verify before reporting done: run the regression test, run the affected module's full suite, confirm the original repro is gone.

3. Refactoring

Refactor [target]. Done = behavior is byte-identical before and after, all existing tests pass without modification, types and lint clean, diff scoped to the refactor.

Stop & ask if: you can't preserve behavior without changing a test (means the refactor changed semantics), the refactor naturally pulls in a 3rd+ file beyond what we discussed, or you find a real bug while refactoring (surface it separately, don't silently fix it inside the refactor diff).

No preamble. Don't explain the refactor plan before doing it. When done, report what moved, what's now where, and what was verified in 2-4 lines.

Verify before reporting done: run the FULL test suite (refactors break unexpected places), typecheck, build.

4. Migration / upgrade

Migrate [target] from [old] to [new]. Done = the codebase compiles and runs on the new version, all existing tests pass without behavior changes, deprecation warnings from the migration are resolved (not suppressed), diff is scoped to the migration only.

Stop & ask if: the new version requires a behavior change that affects users (don't make that call alone), the migration touches config, infra, or build files in ways we didn't discuss, or you find code that depends on the old version's bugs (genuinely tricky - surface it, don't paper over it).

No preamble. Don't list every breaking change in the changelog before starting - read the changelog yourself and apply what's needed. When done, report what was migrated, what was left untouched and why, and any deprecation warnings still standing.

Verify before reporting done: run the full test suite (migrations break unexpected places), typecheck, build. If the project has integration or e2e tests, run those too - unit tests pass through migrations more often than you'd think.

5. Adding tests to existing code

Add tests for [target]. Done = the tests exercise the actual behavior (not implementation details), they pass against the current code, they would fail if the behavior broke, coverage hits the meaningful branches not just the happy path.

Stop & ask if: the code is genuinely hard to test because of how it's structured (don't refactor it to make testing easier without checking), you find a real bug while writing tests (surface it separately, don't quietly fix it), or the existing tests already cover this and I missed it.

No preamble. Don't outline the test plan before writing - just write the tests. When done, report what's covered, what's intentionally not covered, and anything you found while writing them.

Verify before reporting done: run the new tests (must pass), then mutate the code under test in a small way and rerun (the tests must fail - if they don't, they're testing the wrong thing). Run the full suite to make sure nothing else broke.

And here are 5 things to avoid:

Telling Codex HOW to solve it instead of what done looks like
Asking GPT to create a prompt for itself
Using the same chat for more than one task
Sequential file reads on multi-file tasks (waste of latency)
Trusting "this should work" without running the tests (never do this)

Alright, if you take one thing from this: before you reach for that Extra High button, rewrite the prompt using the tips above. (and give me a follow)

Read more: developers.openai.com/api/docs/guides/prompt-guidance

Skills don't work the way we think they do

port — Sun, 17 May 2026 12:29:53 +0000

I just finished reading SkillBench paper: https://arxiv.org/pdf/2602.12670

And the results are definitely not what most people expect.

What researchers did

They did 86 real-work tasks across 11 domains and executed 7,308 runs.

Each task was tested in three modes:

Baseline (no skills)
Curated skills (human-written)
Self-generated skills by the model

Without further ado, below are some conclusions that I found interesting in the paper.

Self-generated skills don't help

One of the most hyped ideas in agent research is:

"Let the model write its own tools / skills."

But it is mostly a wasted effort. In this research, self-generated skills produced no meaningful improvement over baseline.

In some cases, they made performance worse.

Today's models simply cannot reliably create useful reusable procedural abstractions.

This matters because a huge part of current agent research assumes models can recursively improve by generating better skills/tools. This benchmark suggests that assumption is premature.

Human-made skills work A LOT better

When Skills were carefully written by humans, performance jumped +16.2 percentage points on average.

But here's what's even more surprising:

Domain variance was extreme

Some domains saw small gains (~4-5 pp)
Others saw enormous gains (~50+ pp)

Skills don't help the same in different fields.. They disproportionately help in structured, procedural domains.

Smaller models + skills ≈ bigger models without skills

A smaller model with curated Skills matched or exceeded a larger model without Skills.

This is huge for cost optimization:

Local agents
Edge deployment
Open-source models

Too many skills can hurt

Overly broad or verbose skill libraries degraded performance. Focused, minimal skill modules performed better.

Pick your skills carefully. 2-3 skills work better than 4+ skills.

Here is my takeaway

If this paper is right (and i think it is, mostly because of my personal experiences with skill files):

Scaling alone isn't enough
Autonomy narratives are premature
Skill architecture design is now a first-class research problem

Read the full paper: https://arxiv.org/pdf/2602.12670

so... how to create a skill that works?

port — Sun, 17 May 2026 12:29:52 +0000

In my previous article, I argued that skills don't work the way most people expect.

Related: Skills don't work the way we think they do

The data from SkillBench supports this. Attaching skills doesn't automatically guarantee better performance.

So the real question becomes:

If skills don't magically fix models... How do you engineer them properly?

To answer that, we need to understand how knowledge itself works.

I think human knowledge is like a block of cheese.

It grows over time, with holes ever-present.

When we hit something we don't know, we:

look it up
learn it
apply it
patch the hole and move forward

LLMs don't do this.

When they hit a hole, they don't say "I don't know."

They hallucinate. They lazily fill the gap with plausible-sounding but incorrect information.

Aaand that's where things break, and we, being the superior entity, come in to help.

The Two Types of Holes

Through trial and error, I've noticed there are two kinds.

1. Knowledge gaps

Example:

My OpenClaw agent tries to open a browser extension. It fails.

I tell it:

"You already have a browser. Open that."

Suddenly the dumdum understands the task and opens the freaking browser.

It wasn't incapable.

It just didn't reason through the environment correctly.

That's a hole.

2. Moldy knowledge

Sometimes it does know something, but it's outdated.

Examples:

Using useScaffoldContractRead instead of useScaffoldReadContract in Scaffold-ETH
Manually defining Monad mainnet instead of importing from viem/chains

That's stale info on the LLM's side. I call it mold.

And mold spreads silently. If you don't correct it once, it keeps reappearing in future runs. And you might never notice it.

How I Create Skill Files

Here's my actual process.

1. I let the model fail

For example, when I was building the monad-development skill, I simply said:

"Create a token on Monad."

That's it. Then I watched it fail.

I didn't over-direct it.

I wanted to see where the holes were.

2. I take notes on every failure

This sounds weird but yes I watch it and take notes/let it takes notes afterwards. after the LLM completes its run. I ask it "What did you have problems with?", "What did you fail to do on the first try?", and I go and check if the thing I asked for is built the way I wanted it to be.

3. I create the skill.md file

The skill file contains the patches to fill in the gaps of the LLMs knowledge and remove mold+fill in the gap that is created by removing the moldy part.

The file is concise, specific, and clear.

4. I re-run and benchmark

I run the same prompt again with the skill attached. If it still struggles, I refine the skill.

I repeat until:

First-attempt success rate is high
Hallucinations drop(mostly)
Tool usage becomes clean and consistent

What This Really Is

This is systematic failure harvesting. Treat the LLM as a system with blind spots and engineer around them.

Prompt. Let it fail. Take notes. Create a skill file out of your notes. Rinse and repeat until you are at a desired success rate.

This is how you create a skill that actually works.

I built a copy-for-LLMs button for Docusaurus. Then Ethereum and Sui shipped it.

port — Mon, 27 Apr 2026 19:01:36 +0000

*A few months ago I got tired of selecting docs pages and pasting them into Claude. Half the time the nav came along with the content. So I built docusaurus-plugin-copy-page-button: a one-line install that drops a Copy page button into your Docusaurus sidebar.
*
When I click the button, I get the page as clean markdown. I also added a dropdown that opens the page directly in ChatGPT, Claude, or Gemini.

Setup:

npm install docusaurus-plugin-copy-page-button
Then one line in docusaurus.config.js:

plugins: ['docusaurus-plugin-copy-page-button']
That's it.

Six months later, I see the plugin running on:

Ethereum execution-apis
Sui, Walrus, Seal, SuiNS (Mysten Labs)
Monad
Flare
Kaia
Nillion
Chronicle

Around 10k installs a month, mostly blockchain ecosystems. I didn't aim at that niche, it just landed there.

What was actually hard

Three things took most of the time.

Content extraction. Docusaurus pages come wrapped in nav, breadcrumbs, edit-this-page links, footers, and a sidebar. The plugin walks the DOM, finds the article container, drops the chrome, and hands the rest to a markdown converter that handles code blocks, tables, lists, and admonitions.

Then SPA route changes. Docusaurus uses client-side navigation. Inject the button on first load and it vanishes when the user clicks a link. The plugin watches popstate, Docusaurus's own events, and URL changes, then re-injects on each route.

And mobile. Docusaurus collapses the TOC sidebar on small screens. The button needs to live somewhere visible without breaking the layout. Took a few iterations.

Try it

If you run a Docusaurus site, install it. If something's missing, open an issue.

DEV Community: port

I asked Fable 5 to build a dex. Here's how it went.

Same test, opposite end of the scale

The contracts compiled first try, and it debugged its own tests

It read the library instead of dreaming it

It built its own way to click the buttons

One human, one job: gas

I asked Gemma 4 12B to create a dapp. Make no mistakes.

The first draft was good code that didn't compile

Then it couldn't debug its own tests

The frontend looked finished and was hollow

The reveal: it was Monad, and it took one line

So how usable is it?

Simple just works: how i built puddleswap

The thing I almost overengineered

The enumeration

The agent will hand you the general solution

When this breaks

The end

You don't know how to vibe-code

Step 1: Think about the end goal

Step 2: Build the MVP, then use it

Step 3: Iterate with small, focused prompts

Step 4: Read the Code

What I Learned

You are prompting GPT 5.5 wrong.

1. Lead with the outcome

2. Kill the preamble

3. Bias to action, finish what you start

4. Read in parallel, not one file at a time

5. Make it actually verify

1. Building a feature

2. Fixing a bug

3. Refactoring

4. Migration / upgrade

5. Adding tests to existing code

And here are 5 things to avoid:

Skills don't work the way we think they do

What researchers did

Self-generated skills don't help

Human-made skills work A LOT better

Smaller models + skills ≈ bigger models without skills

Too many skills can hurt

Here is my takeaway

so... how to create a skill that works?

The Two Types of Holes

1. Knowledge gaps

2. Moldy knowledge

How I Create Skill Files

1. I let the model fail

2. I take notes on every failure

3. I create the skill.md file

4. I re-run and benchmark

What This Really Is

Further reading

I built a copy-for-LLMs button for Docusaurus. Then Ethereum and Sui shipped it.

What was actually hard

Try it