Obinna Duru

Posted on Apr 27

Stop Guessing, Start Proving: A Guide to Stateful Fuzzing in Foundry

#web3 #blockchain #solidity #foundry

We are building MilestoneCrowdfundUpgradeable: a smart contract that holds donor funds in escrow and releases them only when real-world milestones are verified. In our last post, we designed the engine for this protocol. We built a theoretical fortress guarded by strict mathematical laws.

But in Web3, a theoretical fortress isn't good enough. If you write a web app and it has a bug, a button doesn't work. If you write a smart contract and it has a bug, people lose their life savings. Testing isn't an afterthought; it is a matter of life or death for your protocol.

To test this protocol, I didn't just write a few scripts to see if the functions worked. I built a mathematical proving ground. In this post, I want to show you exactly how I did that. We are going to talk about why I switched to Foundry, how to use a "fuzzer" to find your blind spots, the magic of Ghost Variables, and the hardest lesson I learned about testing state machines.

The Shift: Why I Chose Foundry Over Hardhat

When I write tests in Hardhat, I am writing JavaScript that talks to a Solidity contract. There is a translation layer between my brain and the blockchain. I find myself fighting JavaScript's async/await, weird big-number edge cases, and ABI encoding at the exact same time I am trying to think about protocol correctness. It's exhausting.

Foundry breaks that barrier. In Foundry, your tests are written in Solidity. The EVM (the actual engine that runs Ethereum) is the test runtime. That means there is no JavaScript middleman translating your code. If I want to pretend to be a user named Alice, I just write vm.prank(alice). If I want to fast-forward time by 30 days, I write vm.warp(block.timestamp + 30 days). It reads like pseudocode.

But the single biggest reason I switched, the one that changed how I think about testing entirely is exactness. In Hardhat, you approximate gas. In Foundry, the gas numbers in your test output are the exact gas numbers you will see on mainnet. When you're designing a crowdfunding protocol where everyday donors are paying for every pledge call, that exactness matters enormously.

Hardhat is an incredible tool for deployment and scripting, but Foundry was built from the ground up for testing. That is not a knock on Hardhat, it is just not the same tool.

Stateless Fuzzing: Why Random Beats Hardcoded

Here is the problem with writing pledge(100) in your test suite. Your brain picked the number 100. Your brain also wrote the smart contract. So your test is testing your own assumptions, not the contract's actual behavior. You have the same blind spots on both sides of the assertion.

Stateless Fuzzing breaks that loop. A fuzzer is a robot that throws thousands of random, chaotic inputs at your contract to see what breaks.

For example, I wrote a function to calculate the platform fee (_calculateFeeAndNet). The obvious test is: gross = 1000, fee = 5%, expect net = 950. That passes. Fine.

But what does the Foundry fuzzer actually throw at it?

gross = 1, fee = 4.99%. (Does integer division correctly floor the fee to zero, leaving the full amount as the net?)
gross = type(uint256).max. (Does it trigger an overflow crash before the math finishes?)
fee = 0. (Does the contract actually return (gross, 0), or does it accidentally return (0, 0)?)

None of those are inputs I would naturally write. The fuzzer doesn't have my assumptions. That's its superpower.

This was crucial for my Basis Points (BPS) math. I wrote a helper to divide 10,000 BPS across n milestones. I could have hardcoded an array of 3 milestones and called it done. Instead, the fuzzer randomly varies n on every single run. Eventually, it tries n=1 (a single milestone getting 100%) and n=20(20 milestones, each getting ~500 BPS, with the last one absorbing all the mathematical rounding dust). Those extreme edge cases stress the math in ways a handwritten array never would.

Stateful Fuzzing & Ghost Variables: The Bank Auditor

Stateless fuzzing tests one function at a time. But a crowdfunding contract isn't a calculator, it's a state machine. The bug is almost never in a single function in isolation. It's in the sequence. pledgeETH is fine. withdrawMilestone is fine. But what happens if the sequence is pledgeETH -> setFee -> pledgeETH -> finalize -> withdrawMilestone? That is where the accounting drifts.

To test sequences, we use Stateful Fuzzing.

To explain how this works, imagine you are a Bank Auditor. You want to verify that a bank teller never gives out more money than was deposited. You could stand at the counter and check after every single transaction. But the teller's official ledger is locked inside a vault you can't open. You can only see the cash they hand to customers.

So, you bring your own notepad. Every time someone deposits $100, you write "+$100" on your notepad. Every time someone withdraws, you write "-$50". At the end of the day, your notepad says the vault should hold exactly $50. If the vault actually holds $40, something went wrong in the sequence of the day's transactions.

In my Foundry tests, that "notepad" is called a Ghost Variable.

The actual pledge ledger inside my smart contract (_pledges) is private. The test suite can't read it easily. So, I built a Handler contract that acts as the Auditor. It carriesghost_netPledge, its own running tally of what the contract should contain. Every time a random pledge succeeds, the handler writes it down. Every time a refund succeeds, it zeroes that entry.

At the end of thousands of random actions, the test asks: Does the real smart contract match my notepad? Imagine a scenario where Alice pledges 1 ETH, but a bug prevents the platform fee from being deducted. My Ghost Variable notepad records 0.95 ETH (the net), but the contract's getPledge() returns 1.00 ETH. The moment they diverge, the invariant fires. The fuzzer stops and prints the exact 12-call sequence that broke the math. That sequence is my counterexample. That's the bug report.

The Struggle: 4,495 Seconds of Patience

Here is the part nobody talks about in tutorials.

You can see it in the screenshot from my terminal. 4,495 seconds. That is over an hour and fifteen minutes of my laptop running at full capacity just to complete one single test run.

That number tells the real story of stateful fuzzing. My fuzzer ran this long because of the rigorous constraints I set. Here is exactly how I configured it. runs is how many random sequences the fuzzer invents, depth is how many calls deep each sequence goes, and the seed ensures the run is mathematically reproducible so I can track down bugs:

[invariant]
runs = 1000
depth = 300 
fail_on_revert  = false
call_override_addr = false

[fuzz]
runs = 1000
max_test_rejects = 65536
seed = "0x63726f776466756e64"

The fuzzer is not running one test, it is inventing thousands of random call sequences, executing them, and checking whether the math breaks. That computational cost is the point. You are paying for thoroughness with time.

But the truly humbling part is the feedback loop.

When you write a normal unit test and something breaks, you fix it and rerun it in seconds. With a stateful fuzz suite, the cycle feels like this:
Find a bug -> fix the contract -> wait 75 minutes -> find another bug -> fix -> wait another 75 minutes.

Now, to be fair to Foundry, it does provide a lifesaver here. When a stateful fuzz run fails, Foundry gives you the exact seed and call sequence, allowing you to replay just that one failed scenario in seconds without having to rerun the entire 75-minute suite. But the broader lesson remains.

You learn very quickly to think carefully before you type. Every structural change carries a heavy price tag in computational validation. That constraint made me a more deliberate engineer. I stopped making small speculative fixes and started reasoning through the problem fully on paper before touching the code.

The Coverage Illusion

Now look at the numbers in that screenshot more carefully.

MilestoneCrowdfundUpgradeable.sol - 100% line coverage, 100% function coverage.

A junior developer sees that and thinks: We are done, the protocol is safe. I want to push back on that directly. 100% coverage does not mean 100% secure. Not even close.

Coverage only tells you that the fuzzer visited every line. It does not tell you whether the mathematical rules governing those lines can be broken by a clever sequence of calls. A line can be executed ten thousand times and still contain an accounting bug that only surfaces on the ten-thousand-and-first call in a specific order.

This is why the invariants matter more than the coverage number. The invariants are the rules that must never break:

totalWithdrawn must never exceed totalRaised, otherwise, the creator is stealing from future refund claimants.
sum(milestoneBps) must always equal 10,000, otherwise, the math will trap dust in the contract or fail to release the full amount.
totalWithdrawn + totalRefunded must never exceed totalRaised, otherwise, the contract is paying out money that was never deposited, meaning it has become completely insolvent.

If those rules hold across thousands of random sequences, the protocol math is battle-tested. If coverage is 100% but an invariant breaks, you have a vulnerable protocol with a false sense of security. I would rather have 80% coverage and unbreakable invariants than 100% coverage and uninspected state transitions.

The Advice I Would Give My Fellow Dev

Stop writing tests that only ask "did this function return the right number?"

Start writing tests that ask "can this sequence of 50 random actions, executed by strangers in any order, break the rules this protocol was built on?"

Unit tests are essential, do not skip them. But they test your assumptions. The fuzzer tests your blind spots. Once you understand state machines and invariants, you stop guessing whether your protocol is safe and start proving it. That shift in thinking from hoping the code is correct to mathematically verifying it is the difference between a developer and an engineer.

The argument for trusting your test suite is never complete until you prove it works. Before I trusted my invariants, I deliberately broke my own contract. I went into pledgeETH and changed c.totalRaised += net to c.totalRaised += gross. I ran the suite. Instantly, Invariant 6 fired and the test failed. That is the difference between claiming your alarm system works and proving it by tripping it yourself.

The 4,495 seconds were worth every one.

Your protocol can pass every invariant and still be drained in a single transaction. In the next post, we look at how.