Andrei Mashukov

Posted on May 26

The Missing POP: How I Ported a Yul Contract to Huff by Reading Every Opcode

#solidity #ethereum #evm #mev

A war story about hand-managing the EVM stack, two words of litter left behind a CALL, and the debug trace that finally made the drift visible.

I had a contract that worked. Then I rewrote it in a second language so it would behave the same way, faster — and it didn't. The bug wasn't a crash. It was worse: the contract ran to completion, returned status 1, and the value it produced was quietly wrong.

This is the story of that bug, the tool that caught it, and what porting an opcode dispatcher from Yul to Huff taught me about the one thing Yul had been doing for me the whole time without ever saying so.

Two implementations of the same thing

The contract is an opcode dispatcher — it reads a packed byte-stream of commands and routes funds through multi-hop swap paths across seven DEX families. The whole thing is open source (adaptive-mev-router on GitHub), so every snippet below can be read in full context. The detail that matters for this story is that it ships twice: once in Yul as the reference implementation (MEV_V2.yul), and once in Huff as a hand-optimised port (MEV_V2.huff) with an O(1) jump-table dispatch.

Why two? Yul is readable and gives me a reference I can trust. Huff lets me shave the contract down opcode by opcode and control the dispatch path exactly. The deal is that both must behave identically — and the test suite enforces it. The harness loads both builds as two variants and runs every scenario against each:

function loadVariants() {
  const MEVJson = require("../artifacts/contracts/MEV_V2.yul/MEV_V2.json");
  const huffRuntime = fs.readFileSync(huffBinPath, "utf8").trim();
  return [
    { name: "Yul",  abi: MEVJson.abi, bytecode: MEVJson.bytecode },
    { name: "Huff", abi: MEVJson.abi, bytecode: wrapRuntimeBytecode(huffRuntime) },
  ];
}

loadVariants().forEach((variant) => {
  it(`MEV_V2 [${variant.name}]: ...`, async function () {
    // every scenario runs once for Yul, once for Huff
  });
});

If Yul and Huff ever disagree, CI goes red. That harness is the hero of this story. It turned a vague "something feels off" into a precise, reproducible failure. But before it could do that, I had to write the Huff version. And the Huff version is where I met the stack.

What Yul never told me

Here is the thing I didn't fully appreciate until I left it behind: Yul manages the stack for you.

Look at one swap handler in the Yul reference. This is the entire V2 adaptive swap, zero-for-one:

function swap_v2_adaptive_zfo(cursor) {
    let sig      := shr(224, calldataload(cursor))
    let feeBps   := and(shr(240, calldataload(add(cursor, 4))), 0xFFFF)
    let pair     := shr(96,  calldataload(add(cursor, 6)))
    let tokenIn  := shr(96,  calldataload(add(cursor, 26)))
    let amountIn := shr(144, calldataload(add(cursor, 46)))

    amountIn := resolve_amount(amountIn, tokenIn)            // 0 -> balanceOf(this, tokenIn)
    let amountOut := v2_compute_amount_out(pair, amountIn, feeBps, 1)

    transfer_token(tokenIn, pair, amountIn)
    swap_v2(sig, pair, 0, amountOut, address())
}

Five named values. When I write swap_v2(sig, pair, 0, amountOut, address()), the compiler decides where sig, pair, amountOut live, in what order they get pushed, which DUP retrieves each one, and when they get cleaned up. I think in named values. The compiler thinks in stack slots. I never have to know the translation.

Huff removes that layer. In Huff you are the compiler's stack allocator. There are no names — there is a column of 32-byte words, and you address them by how far down they sit right now. Here is the same swap in the Huff port — and notice the comments running down the right side, the stack diagram after every single instruction:

// SWAP_V2 expects [sig, pair, amount0, amount1, to]
// zfo: amount0=0, amount1=amountOut
address                                  // [to, cursor, limit]
0x620 mload                              // [amountOut, to, cursor, limit]
0x00                                     // [0, amountOut, to, cursor, limit]
dup4 0x06 add calldataload 0x60 shr      // [pair, 0, amountOut, to, cursor, limit]
dup5 calldataload 0xe0 shr               // [sig, pair, 0, amountOut, to, cursor, limit]
SWAP_V2()                                // [cursor, limit]

Those comments are not decoration. In Huff the stack layout is the program state, and dup4 only fetches the right value if pair is genuinely four slots down at that exact instruction. There is one more tell in that snippet, and it's the heart of this whole story: 0x620 mload.

When the stack isn't enough — and why that's the warning sign

In Yul, amountIn and amountOut are just locals; the compiler keeps them alive across the whole function for free. In the Huff port I couldn't do that. The V2 swap has to call getReserves() on the pair halfway through to compute amountOut — and that staticcall writes its result into scratch memory, and the reserve math needs a deep working stack of its own. Trying to also balance amountIn and amountOut on top of all that, reachable by dup, across dozens of intervening opcodes, is exactly the kind of bookkeeping that breaks.

So in Huff I spilled them to memory on purpose:

dup1 0x600 mstore     // save amountIn  — getReserves() is about to clobber scratch memory
...
dup1 0x620 mstore     // save amountOut — survive until the transfer + swap at the end

That decision — this value lives too long and travels too deep, put it in memory — is one Yul made silently for me every time. In Huff it's a conscious call, and getting it wrong is a real bug. Which brings me to the bug.

The macro that leaves litter on the stack

Most of the swap macros in the contract are clean: they take their inputs, push them into the right memory slots, and consume everything. SWAP_CURVE_EXEC is the textbook case — five inputs in, every one of them spent, nothing left behind.

Then there's the native-ETH variant. A Curve swap that sends ETH has to pass the amount twice: once written into memory as a call argument, and once as the actual msg.value of the call. Which means, unlike every other swap macro, it cannot just consume pool and amount — it has to keep them alive on the stack until the call itself. Here is the real macro:

#define macro SWAP_CURVE_ETH_EXEC() = takes(5) returns(0) {
    0xe0 shl 0x00 mstore   // sig<<224 at mem[0]. stack: [pool, sellId, buyId, amount]
    swap1 0x04 mstore      // sellId at mem[4]. stack: [pool, buyId, amount]
    swap1 0x24 mstore      // buyId at mem[36]. stack: [pool, amount]
    dup2 0x44 mstore       // amount at mem[68], keep both on stack. stack: [pool, amount]
    0x00 0x64 mstore       // minOut = 0. stack: [pool, amount]
    // call(gas, pool, amount, 0, 132, 0, 32) — amount as msg.value
    0x20 0x00 0x84 0x00    // stack: [0x00, 0x84, 0x00, 0x20, pool, amount]
    dup6 dup6              // stack: [pool, amount, 0x00, 0x84, 0x00, 0x20, pool, amount]
    gas call
    iszero err jumpi
    pop pop                // clean up leftover pool and amount from dup6 dup6
}

Read the last four lines slowly, because they are the whole problem in miniature.

dup6 dup6 reaches deep down the stack and copies pool and amount to the top, because call needs them there as arguments. call consumes its seven inputs and pushes one result. But the originals — the pool and amount that dup6 dup6 copied from — are still sitting down there. The call didn't touch them. They are litter. And the macro is declared returns(0): it promises to leave the stack exactly as deep as it found it. So the macro has to end with pop pop to sweep that litter away by hand.

That pop pop is not optional and it is not obvious. It exists only because of a dup that happened nine instructions earlier. Forget it, and the macro returns two words heavier than it claims to. Nothing reverts. The next opcode in the dispatcher just finds a stack two slots deeper than its comments assume — and every dup and swap it does from that point on reaches for the wrong neighbour.

The bug that didn't crash

That is exactly the bug I shipped.

Not in SWAP_CURVE_ETH_EXEC itself — that one I'd already gotten right, and the pop pop comment is me having learned the lesson once. The bug was in a different macro, one I wrote later, where I did the same dup-deep-then-call pattern and simply did not realise it had left two originals stranded. I'd internalised "call consumes its arguments" and stopped there. But call consumes the copies dup puts on top. It has nothing to say about the originals dup copied from. Those are mine to clean up, and that time I didn't.

Here is what made it vicious: nothing reverted.

The EVM doesn't know a leftover pool from a meaningful value from any other 32-byte word. It's all just words. The macro returned two words heavier than its returns(0) signature claimed. The dispatcher continued, every stack comment from that point on now describing a stack two slots shallower than reality, and the next dup fetched a word two places off from the one I wanted — a different address entirely. The swap was issued with a wrong argument, the transaction ran to the end, and it returned status 1. Success.

The Yul variant of the same scenario returned the correct result. The Huff variant returned a different one. The forEach harness caught the divergence and turned CI red — but all it could tell me was that the two disagreed, not where. I had a contract producing a wrong answer with no revert, no error, no line number.

Reading every opcode

You cannot reason your way out of this from the source. The whole problem is that your reasoning about the stack is what's broken — re-reading the macro just reproduces the same wrong mental model. You need ground truth.

Ground truth is the execution trace. I ran the failing Huff scenario under a debug tracer and dumped the step-by-step opcode log: every instruction executed, and crucially, the stack contents after each one.

Then I did something tedious and completely worth it. I walked the trace one opcode at a time, and beside each line I wrote what I expected the stack to be. Two columns: what the trace said, what I thought.

For a while the columns matched — PUSH, PUSH, CALLDATALOAD, fine. Then I reached the CALL inside the offending macro. On the line after it, the trace still carried two words my map had already discarded. The columns diverged by exactly two slots, and they never re-converged — every subsequent line was off by the same two.

That was the whole bug, sitting in the diff between two columns: a missing pop pop. Two characters. The fix took seconds. Finding it took the trace.

The habit that was already half there

Here's the part I'm slightly embarrassed by: the fix wasn't a new technique. It was doing the thing I was already half-doing, properly.

My Huff already had stack comments. Some of them were even notes-to-self mid-calculation — at one point in the V2 amountOut math I'd literally written, inline:

0x2710 sub   // stack was [feeBps, ...], push 0x2710 -> [0x2710, feeBps, ...],
             // sub -> 0x2710-feeBps. Correct!

That comment is me reverse-engineering my own stack in real time and verifying it. I had the discipline in places. What I lacked was the discipline everywhere — and a missing pop pop is precisely what a lapse looks like. In the macro that bit me, I'd written the stack diagram down the right margin, then changed the instructions during a later edit and didn't re-derive the diagram beneath the call. The comment said one thing; the opcodes did another.

So the lesson wasn't "start commenting the stack." It was: the stack comment is code. It is the source of truth your dup/swap indices are read from, and it has to be updated with the same discipline as the instructions themselves. An out-of-date stack comment is exactly as dangerous as an out-of-date mental model — because it is one, just written down.

Concretely, the rules I now hold myself to:

Every line that touches the stack updates the diagram on that line. Not the top of the macro — every line. Top-of-stack on the left.
When you need a value, read its depth off the current comment and count. Never count from memory. The comment is authoritative; the dup index just obeys it.
dup copies; it does not move. Every dup-deep-then-call pattern leaves the originals stranded below — call only consumes the copies on top. If you dup to reach call arguments, you almost certainly owe a matching pop afterwards. SWAP_CURVE_ETH_EXEC's trailing pop pop is that debt, paid.
A macro's takes/returns signature is a contract — verify it. returns(0) means the stack must be exactly as deep on exit as on entry. Walk the macro and prove it. A macro that secretly returns two words heavy corrupts every caller downstream.
When a value lives long or travels deep, spill it to memory — like 0x600/0x620 for amountIn/amountOut. If keeping it on the stack feels fragile, that feeling is correct; that's the Yul compiler's job knocking, and in Huff the job is yours.
A wrong answer with clean execution? Suspect the stack first. A revert usually means a bad jump or a failed call. A wrong result with status 1 is the signature of a stack that drifted — leftover litter, or a dup/swap that grabbed the wrong neighbour.

Why the trace beat everything else

The bug was an invisible disagreement between my model of the stack and the EVM's actual stack. Source review can't fix that — the review is done by the same broken model that wrote the bug. A debugger that shows only values doesn't help much either, because every value is a 32-byte word and they all look alike; a wrong address is indistinguishable from a right one until you know which slot it should have come from.

What the opcode-level trace gives you is the shape of the stack at every step, independent of your assumptions. It's the one artifact in the toolchain that doesn't share your mental model. Line it up against your expectations and the divergence point is the bug — not "near the bug," the bug, the exact instruction where reality and intention split.

What I'd tell anyone starting with Huff

Huff is wonderful for what it is: total control, no compiler between you and the bytecode, every opcode chosen by you. But "no compiler between you and the bytecode" means the compiler's stack allocator is now a job on your desk, and it is a real job with a real failure mode.

So:

Respect what the high-level language was doing for you. Yul's stack management isn't a convenience — it's an entire correctness layer. Take it over deliberately.
Maintain the stack diagram as code. Inline, every line, updated as rigorously as the instructions. Your dup/swap indices are reads from that diagram.
When behaviour diverges and nothing crashes, go straight to the opcode trace. Don't re-read the source. Walk the trace beside your expectations and find the slot where they part.
Keep a reference implementation and test against it relentlessly. The Yul-vs-Huff forEach harness didn't find the bug for me, but it's the reason I knew there was one. An executable specification you can't argue with beats any amount of careful reading.

The payoff: Yul vs Huff, measured

Once the two implementations agreed byte-for-byte, the harness handed me something else for free. Every scenario runs against both variants and logs receipt.gasUsed, so I got a direct, apples-to-apples gas comparison — same test, same calldata, two compilers.

Huff's hand-built O(1) jump table wins consistently on dispatcher-heavy opcodes; on I/O-dominated swaps the two land within a handful of gas. A selection of measured numbers (Solidity 0.8.28, EVM Cancun, viaIR, optimizer runs=200):

Operation	Yul	Huff	Δ (Huff − Yul)
`0x02` V3 swap zfo (`amount=0`)	56 851	56 800	−51
`0x04` Balancer V2 zfo (`amount=0`)	81 794	81 644	−150
`0x08` `wrap_weth` (adaptive)	37 907	37 782	−125
`0x0A` `unwrap_weth` (adaptive)	34 097	33 921	−176
`0x0B` `transfer_eth`	31 454	31 290	−164
`0x0C` `transfer_erc20`	50 249	50 009	−240
`0x0D` `balance_check`	27 756	27 504	−252
`0x0E` `sweep`	33 775	33 458	−317
`0x19` Balancer V1 zfo (`amount=0`)	82 730	82 257	−473
`0x1A` Balancer V1 ofz	82 880	82 313	−567
`0x1B` Fluid zfo (`amount=0`)	72 234	71 624	−610
`0x1C` Fluid ofz (`amount=0`)	55 171	54 535	−636
`0x1D` DODO zfo (`amount=0`)	86 338	85 696	−642
`0x1E` DODO ofz (`amount=0`)	69 298	68 640	−658
V2 flash 3-hop chain	146 486	145 781	−705
Sandwich backrun (resolve + check + sweep)	89 950	89 152	−798

Negative Δ means Huff is cheaper. The gap widens exactly where you'd expect: the more dispatching a scenario does relative to actual I/O, the more the hand-built jump table pulls ahead. The 3-hop flash chain and the sandwich backrun — the dispatcher-heaviest scenarios — show the biggest savings, around 700–800 gas.

But notice what every row in that table depends on. The numbers are only meaningful because the behaviour column is identical first. A faster implementation that returns a different answer isn't an optimisation — it's the bug I spent this whole article describing. The gas win is real, but it's a footnote to the actual achievement: two independent implementations, in two languages, that a test suite cannot tell apart.

The fix to my bug was two characters: a pop pop that should have been there and wasn't. I think about that a lot. The cost of the mistake and the cost of the fix were wildly mismatched, and the only thing that closed the gap between them was being willing to read every single opcode until the stack told me the truth.

In Huff, the stack always tells you the truth. You just have to be looking at it instead of at your idea of it.

The full contract — both the Yul reference and the Huff port, the forEach parity harness, the fork tests, and the gas-diff CI — is on GitHub: github.com/AndreyMashukov/adaptive-mev-router. The SWAP_CURVE_ETH_EXEC macro in this article lives in contracts/MEV_V2.huff; its Yul counterpart is in contracts/MEV_V2.yul. Stars and issues welcome — and if you spot a stack comment that's drifted, you now know exactly what to look for.

DEV Community