The Bug That Silently Broke My Entire Blockchain How a single function rejected "trailing bytes" and made every block commit with zero transactions

#ai #programming #opensource #blockchain

I spent two days debugging why my from-scratch Layer 1 blockchain committed every block with zero transactions, despite the mempool accepting them perfectly.

This is the story of how I found it, what it taught me, and why silent failures are the hardest bugs in distributed systems.

The setup

I'm building NOVAI, a Layer 1 blockchain in Rust with HotStuff BFT consensus. No forks, no frameworks. Every crate written from scratch. Four validators running on a local devnet, producing blocks at around 75 per second.

The chain worked perfectly. Blocks committed. QCs formed. Validators voted. Everything was green.

Then I added transaction support. And every single block committed with tx_count=0.

The symptoms

The RPC endpoint accepted transactions. Submitted, accepted, zero rejected. The mempool inserted them. drain_ready pulled them into proposed blocks. But every committed block: zero transactions.

No error logs. No panics. No warnings. The chain just kept producing empty blocks as if nothing was wrong.

The investigation

I started with the obvious: is the mempool shared between the RPC thread and the consensus loop? I compared Arc pointer addresses. Same instance. Same mempool.

Then I checked timing. Maybe the leader's block was being replaced by an empty one from a different leader after a timeout. I found a race condition in the timeout handler where round_start_time could be read before the state lock was acquired. Fixed it. Blocks still empty.

Next hypothesis: only node 0 had the RPC endpoint. The other three validators had empty mempools. When they were leader (75% of the time), they proposed empty blocks. I added transaction gossip so all validators share transactions over P2P. Still empty.

I added diagnostic logging at every stage of the pipeline. PROPOSE_DIAG, COMMIT_DIAG, VERIFY_DIAG, QC_DIAG. Every block showed tx_count=0 at the proposal stage. Transactions were being drained from the mempool into the proposed block, but somehow vanishing before the block reached other validators.

The pattern

Then I noticed something. The chain ran perfectly at around 75 blocks per second with empty blocks. But the moment a block contained even one transaction, the chain stalled. Timeouts fired. Round numbers escalated. No recovery.

This wasn't a "transactions get lost" bug. This was a "transactions kill consensus" bug.

The root cause

Deep in the codec layer, there was a function called decode_tx_v1_signed(). It decoded a single transaction from a byte buffer, then checked if there were any remaining bytes. If so, it rejected the input as "trailing bytes."

This is correct behavior when decoding a standalone transaction from an RPC call. One transaction, one buffer, no leftovers.

But inside decode_block_v1(), the buffer contains multiple transactions concatenated together. The decoder would parse the first transaction, see the remaining transactions as "trailing bytes," and silently return an error.

Every block with one or more transactions failed to decode at the network layer. Validators never received the proposal. They never voted. No quorum certificate formed. The chain stalled.

The fix was one new function: decode_tx_v1_signed_streaming(). It advances the cursor without checking for trailing bytes. Used exclusively inside block decoding. The original function is preserved for standalone transaction decoding where the trailing bytes check is correct.

What happened after the fix

The chain immediately started committing blocks with transactions. All four validators reaching consensus. Transaction gossip working across the network. The chain has since committed over 16 million blocks.

What I learned

Silent failures are the hardest bugs in distributed systems. There were no error logs, no panics, no stack traces. The chain just produced empty blocks and looked healthy. Every metric was green except the one that mattered.

Systematic elimination is the only approach that works. I ruled out dual mempool instances, lock contention, timeout races, leader rotation, cache eviction, and codec round-trip failures, one by one. Each hypothesis was tested, disproved, and crossed off.

The fix was 20 lines of code. The investigation was two days. The ratio between understanding and implementation is always lopsided in distributed systems, and that's fine. The understanding is the hard part.

The technical details

For anyone working on custom binary codecs in Rust: be careful with "trailing bytes" checks in your decoders. They're correct for standalone message parsing but catastrophically wrong when the same decoder is reused inside a container format where multiple messages are concatenated. The streaming pattern (advance cursor, don't check for leftovers) is the right approach for container decoding.

The codebase is on GitHub: github.com/0x-devc/NOVAI-node

65,000+ lines of Rust. 4,000+ tests. Zero unsafe code.