DEV Community: Jayanth Kumar

I Made a Single CUDA Kernel Speak: Streaming Qwen3-TTS at 50ms Latency on an RTX 5090

Jayanth Kumar — Thu, 09 Apr 2026 21:49:30 +0000

My first measurement said 35,932 milliseconds. The target was 90.

That's not a typo. Thirty-five seconds to produce the first chunk of audio from a text-to-speech system that was supposed to feel like a natural conversation. I was off by a factor of 400. And I had less than a day to fix it.

Here's how I went from knowing nothing about CUDA megakernels, nothing about TTS pipelines, and nothing about Pipecat — to streaming real-time speech synthesis at 50ms TTFC and 0.17 RTF on a single RTX 5090. With 3 lines of kernel code changed.

Self Challenge

The self task was deceptively simple on paper: use AlpinDale's qwen_megakernel , a ~1,200-line CUDA program that runs Qwen3-0.6B text generation at 1,000 tokens/second on an RTX 5090 and make it run Qwen3-TTS speech synthesis inside a Pipecat voice agent pipeline.

Two hard targets:

TTFC (time to first audio chunk): < 90 ms — how long before the user hears anything
RTF (real-time factor): < 0.3 — generating 1 second of speech must take less than 300ms

And one non-negotiable constraint: the audio must stream frame-by-frame to Pipecat. No buffering the full utterance. The user should start hearing the response almost immediately.

I'd never read a CUDA kernel before. I'd never built a TTS pipeline. I'd never used Pipecat. Let's go.

What Even Is a Megakernel?

Before I could adapt something, I needed to understand what it does.

Normally, when PyTorch runs a transformer model, it launches hundreds of separate GPU operations, one for each matrix multiply, one for each attention computation, one for each normalization. Each launch has overhead: the CPU tells the GPU what to do, the GPU starts up, does the work, finishes, and waits for the next instruction. For a 28-layer transformer, that's hundreds of round trips per token.

A megakernel eliminates all of that. It packs the entire forward pass, all 28 layers, all the matrix multiplies, all the attention, all the normalization, into a single GPU program. 128 persistent thread blocks, each with 512 threads, launched once and running continuously. Data flows through shared memory and L2 cache instead of being written to and read from global DRAM between operations.

The result: 1,000 tokens per second on a single RTX 5090. That's 0.97 milliseconds per decode step. The kernel hits 71% of the theoretical memory bandwidth of GDDR7. It's fast.

But it was built for text generation — Qwen3-0.6B, vocabulary of 151,936 tokens, standard RoPE. I needed it to generate audio codes — Qwen3-TTS's talker decoder, vocabulary of 3,072, different RoPE frequency, and a completely different input/output format.

The Lucky Break: Same Architecture, Different Vocabulary

My first real breakthrough came from comparing the two model configs side by side. The Qwen3-TTS talker decoder is, architecturally, the exact same model as Qwen3-0.6B. Same hidden dimension (1024). Same 28 layers. Same 16 query heads, 8 KV heads, head dimension 128, intermediate size 3072. The transformer backbone is identical.

The differences were cosmetic:

Vocabulary: 3,072 audio codes vs. 151,936 text tokens
RoPE frequency: 1,000,000 vs. 10,000
Output layer: Separate weight matrix instead of tied with input embeddings

This meant I didn't need to rewrite the kernel. I needed to change parameters.

The vocabulary change was trivial — the kernel reads LDG_VOCAB_SIZE as a compile-time constant. I created a new build script that passes -DLDG_VOCAB_SIZE=3072 instead of 151936, and reduced the LM head thread blocks from 1,280 to 16 (48x fewer tokens to scan). No kernel code changes.

The RoPE frequency only affects precomputed tables in Python. No kernel changes.

The untied output layer just meant loading a separate weight tensor. No kernel changes.

So far: zero lines of CUDA modified.

The 3-Line Patch That Made Everything Possible

Here's where I needed to actually touch the kernel.

In text generation, each decode step feeds the model a single token ID. The kernel looks up that token's embedding from a table. Simple.

But in TTS, the input to each decode step is a sum of 17 different embeddings: 16 codebook group embeddings from the previous audio frame, plus a trailing text token. There's no single token ID for a weighted sum of 17 vectors.

I could have launched a separate GPU operation to compute this embedding before each decode step. But that adds launch overhead every single frame — and when you're generating 12.5 frames per second and each frame needs to be fast, those microseconds add up.

Instead, I added an embedding sentinel — 3 lines in kernel.cu:

const __nv_bfloat16 *embed_row =
    (input_token_id >= 0) ? embed_weight + input_token_id * HIDDEN_SIZE
                          : hidden_buffer;

When token_id >= 0: normal embedding lookup, exactly as before. When token_id == -1: skip the lookup, read from a pre-filled buffer where Python has already written the combined embedding. No extra kernel launch. No synchronization. Fully backward-compatible.

That's it. Three lines. The only CUDA code I changed in the entire project.

The Discovery That Saved Everything: `num_layers` Is Runtime

This is the part of the story where I went from "this might work" to "this is going to crush the targets."

Qwen3-TTS doesn't just have a talker decoder. After each decode step, a separate code predictor — a smaller, 5-layer transformer — takes the talker's hidden state and expands it into 15 additional codebook groups. Each audio frame needs all 16 groups: 1 from the talker, 15 from the code predictor.

My first code predictor implementation used standard PyTorch. It worked. It also took 179 milliseconds per frame.

Let me do the math on that. At 12.5 frames per second, generating 1 second of audio takes 179ms × 12.5 = 2,237ms. That's an RTF of 2.24 — seven and a half times over the target. The code predictor alone made the entire project impossible.

I started looking at torch.compile, CUDA graphs, custom attention kernels. Then I noticed something while re-reading the megakernel's main loop:

for (int layer = 0; layer < num_layers; layer++) {

num_layers isn't a constant. It's a runtime parameter. The kernel doesn't care if it's 28 or 5 or 1 — it just loops that many times.

And the code predictor has the exact same architecture as the talker — same hidden size, same attention heads, same everything. Just 5 layers instead of 28.

What if I just... call the same kernel with num_layers=5?

I packed the code predictor's weights into the same struct format the kernel expects. Allocated a separate, smaller KV cache. Called the same compiled binary.

179ms → 10.9ms.

An 18x speedup. Zero kernel code changes. RTF dropped from 2.24 to 0.175. Target obliterated.

This was the single most important insight of the entire project: the megakernel isn't just for the talker decoder. It's a general-purpose transformer accelerator that you can reuse for any model with the same architecture, at any layer count, by just swapping the weights and adjusting one parameter.

The 35-Second TTFC and the One-Word Fix

With the kernel adapted and the code predictor running fast, I wired up the full pipeline: text tokenization → embedding projection → talker prefill → autoregressive decode → code predictor → vocoder → audio chunks. End to end.

First measurement: TTFC = 35,932ms. RTF = 0.605.

The RTF I expected — the code predictor was still on PyTorch at that point. But 35 seconds for TTFC? The system was generating the entire utterance before sending a single audio chunk.

I looked at my frame generation function:

def _generate_codec_frames(self, text):
    frames = []
    for step in range(max_frames):
        frames.append(generate_one_frame())
    return frames

It returns a list. The streaming wrapper iterates over that list. But by the time it gets the list, all 2,048 frames have already been generated.

The fix was one keyword:

def _generate_codec_frames(self, text):
    for step in range(max_frames):
        yield generate_one_frame()

yield instead of append + return. A Python generator hands each frame to the consumer the moment it's produced. TTFC: 35,932ms → 1,096ms. A 33x improvement from changing one word.

Chasing the Last 1,000 Milliseconds

At 1,096ms, I could see the real bottleneck: the vocoder's first decode call took 834ms. CUDA is deeply lazy — the first time you run an operation, it allocates memory, compiles internal kernels, and sets up buffers. The second time, it's instant.

I added warmup: three dummy vocoder decode calls during engine initialization, with different input sizes. Cold start: 834ms. After warmup: 38ms. TTFC: 1,096ms → 192ms.

Still over target. The code predictor's first call took 107ms instead of 13ms. Same story — but trickier. My warmup code ran with do_sample=False (argmax — pick the most likely token). But the real pipeline uses sampling (randomly pick from top candidates). torch.multinomial, torch.softmax, and torch.topk each have their own first-call overhead, independent of the argmax path.

I added sampling warmup. TTFC: 192ms → 92ms. Almost there.

The last 12ms came from batching. I was computing text embeddings in 5 separate calls (role tokens, content tokens, special tokens...), each launching its own chain of GPU operations. Combining them into a single call cut embedding time from 14ms to 7ms. Then I precomputed every embedding that doesn't change between utterances — role tokens, TTS special tokens, codec tags — and cached them during initialization. Another 6ms saved.

Final TTFC: 50.5ms (non-streaming pipeline test). 81.6ms with vocoder and streaming overhead.

Target was 90ms. We got there with room to spare.

The Prefill Format: Reading 1,800 Lines of Source Code to Find 3 Token IDs

Not everything was about performance. I spent an embarrassing amount of time debugging poor audio quality before realizing the problem was in the prefill format — the sequence of tokens that primes the decoder before it starts generating audio.

The Qwen3-TTS decoder expects an 8-step conditioning sequence. Three of those steps use "thinking tokens" — special IDs (2155, 2156, 2157) that represent a compressed thinking phase. These aren't mentioned in the model card. They're not in the config file. They're not in any documentation.

I found them by reading line 2136 of modeling_qwen3_tts.py — the 1,800-line generation script that ships with the model. My initial implementation used padding tokens in those positions. The model generated audio, but it sounded off. Once I matched the exact official format, quality improved immediately.

The lesson: when you're integrating with a specific model, the source code is the only reliable documentation.

The One Thing I Couldn't Fix

The talker decoder uses M-RoPE (Multimodal RoPE) — a variant of positional encoding that splits the 64 head-dimension pairs into three groups of [24, 20, 20], each potentially tracking a different position.

The megakernel implements standard 1D RoPE. All pairs use the same position.

Implementing M-RoPE would mean modifying the attention computation — the most performance-critical, most carefully tuned part of the entire kernel. AlpinDale spent months optimizing those lines. A subtle bug would silently corrupt every output.

I chose to ship without it. For text-only TTS (no vision input), the three M-RoPE positions are identical anyway, so the immediate impact is minimal. The practical consequence: the model doesn't reliably emit an end-of-sequence token, so I use a word-count-based heuristic to estimate when to stop generating. It works well enough — "Hello, how are you?" produces 4 seconds of audio, a 52-word paragraph produces 42 seconds — but it's not perfect.

I document this honestly as a known limitation, with a clear description of what would fix it.

The Final Numbers

Metric	Result	Target
TTFC (non-streaming)	50.5 ms	< 90 ms
TTFC (streaming + vocoder)	81.6 ms	< 90 ms
RTF (non-streaming)	0.175	< 0.3
RTF (streaming)	0.234	< 0.3
Code predictor (megakernel)	10.9 ms/frame	—
Talker decode	~1 ms/step	—

TTFC breakdown (50.5ms):

Tokenize: 2.3ms
Embed text: 7.2ms
Prefill (8 megakernel steps): 24.9ms
First talker decode: 3.1ms
First code predictor: 13.0ms

Per-frame cost (~15ms for 80ms of audio):

Talker decode: ~1ms
Code predictor: ~11ms
Embedding computation: ~1ms
Vocoder (amortized): ~2ms

Every number measured with torch.cuda.synchronize() barriers. No hand-waving.

What I Actually Changed in the Kernel

Let me be precise:

3 lines in kernel.cu: The embedding sentinel (token_id < 0 → read from buffer instead of embedding table).

2 constants in the build script: LDG_VOCAB_SIZE=3072 (was 151936), LDG_LM_NUM_BLOCKS=16 (was 1280).

0 lines for the biggest optimization: The num_layers runtime parameter was already there. I just called the same kernel with num_layers=5 for the code predictor.

Everything else — weight loading, the TTS pipeline, the streaming engine, the Pipecat integration, the vocoder, the warmup strategy, the prefill format — was Python. About 4,500 lines of it.

Pipecat Integration: The Clean Part

With the engine producing streaming audio, plugging it into Pipecat was the cleanest part of the project. Pipecat has a well-designed TTS service interface: subclass TTSService, implement run_tts as an async generator that yields audio frames, and the framework handles routing, turn management, and interruptions.

The full voice agent pipeline flows like a conversation:

User speaks → Deepgram STT → OpenAI LLM → Megakernel TTS → User hears

Audio streams frame-by-frame — the first chunk (80ms of audio) ships as soon as it's decoded, with subsequent chunks batching 10 frames (~800ms) for efficiency. There's also a text-only mode for testing without a microphone or API keys.

The Optimization Journey, Visualized

Metric          Start          →  End
──────────────────────────────────────────
TTFC            35,932 ms      →  50.5 ms      (711x improvement)
RTF             0.605          →  0.175         (3.5x improvement)
Code predictor  179 ms/frame   →  10.9 ms/frame (18x improvement)
Kernel changes  -              →  3 lines

The 711x TTFC improvement came from five things stacking:

Generator streaming (yield vs return) — 33x
Vocoder warmup — 6x
Sampling path warmup — 2x
Batched embedding — 1.1x
Precomputed constants — 1.15x

The RTF improvement came from one thing: running the code predictor through the megakernel instead of PyTorch. That single insight — noticing that num_layers is runtime, not compile-time — was worth an 18x speedup.

What I'd Do Next

Two things would make the biggest difference with more time:

M-RoPE in the kernel. Modify the RoPE rotation in ldg_attention to split the 64 head-dimension pairs into three groups with independent position counters. This would fix EOS detection, eliminate the frame limit heuristic, and likely improve audio quality. It's a surgical change — maybe 20 lines — but it touches the hottest path in the kernel, so it needs very careful testing.

Token suppression. The official Qwen3-TTS implementation suppresses tokens 2048-3071 (except EOS) during talker decode. Without this, the model can occasionally generate meaningless special tokens. Adding a suppression mask to the logits before argmax/sampling is straightforward in Python and would improve output consistency.

Takeaways

If you're doing inference engineering and this kind of work interests you:

Read the kernel, even if you don't understand all of it. I didn't understand every line of kernel.cu. But I understood enough to notice that num_layers was runtime, that the embedding lookup was a clean injection point, and that the RoPE tables were precomputed in Python. Those three observations drove every optimization in the project.

Profile before you optimize, but also before you despair. When I saw 35,932ms TTFC, I could have panicked. Instead I profiled, found that 99.7% of the time was spent generating frames that hadn't been yielded yet, and fixed it with one keyword. The real bottleneck was always hiding behind a bigger, more obvious one.

The boring optimizations compound. Generator streaming, warmup routines, batched embeddings, precomputed constants — none of these are flashy. But 33x × 6x × 2x × 1.1x × 1.15x = 500x. The flashy optimization (megakernel code predictor) was "only" 18x. The boring ones, stacked, were 28x more impactful.

Ship what works, document what doesn't. I could have spent days trying to implement M-RoPE in the kernel. Instead, I shipped with a workaround, documented the limitation clearly, hit both performance targets, and moved on. Perfect is the enemy of shipped.

The full source code is at github.com/jayanth-kumar-morem/qwen-megakernel-tts. Built on AlpinDale's qwen_megakernel, Qwen3-TTS, and Pipecat.

Hedgehog-Enabled Verifiable Instant Runoff Voting with Extreme Coercion Resistance on Solana

Jayanth Kumar — Thu, 09 Apr 2026 21:48:33 +0000

Abstract

Electronic voting systems deployed on public blockchains offer the promise of unprecedented transparency and verifiability, yet contemporary solutions often present a stark trade-off between the expressiveness of the supported voting schemes and the robustness of their protection against sophisticated coercion tactics. This paper introduces a novel, fully-decentralized electronic voting system architected on the Solana blockchain that, for the first time, synthesizes verifiable Instant Runoff Voting (IRV) with a cryptographically-enforced mechanism for "extreme coercion resistance." Our system extends a baseline zero-knowledge voting architecture by incorporating two pivotal innovations: (1) a matrix-based ballot representation and a universally verifiable shifting procedure for IRV tallying, which enables complex, multi-round election computations to be conducted with full integrity; and (2) a "hedgehog/nullification" protocol, which empowers voters or their designated agents to irrevocably and anonymously cancel a coerced vote by submitting a zero-knowledge proof of key ownership. We provide a detailed specification of the requisite modifications to the on-chain smart contract logic, the off-chain decentralized storage architecture, and the formal design of three new, highly-optimized Circom circuits for proving ballot well-formedness, verifying state transitions between IRV rounds, and executing vote nullification. A rigorous security analysis is presented, formally demonstrating ballot secrecy, end-to-end verifiability, and resilience against the extreme coercion model. Finally, we conduct a comprehensive performance analysis, illustrating how the strategic application of advanced Arithmetization-Oriented (AO) hashing primitives and the inherent parallelism of the Solana network render our expressive and secure voting system computationally practical for real-world deployment.

1. Introduction

The escalating global demand for accessible, transparent, and secure democratic mechanisms has catalyzed significant research and development into blockchain-based electronic voting solutions. Such systems hold the potential to deliver end-to-end (E2E) verifiability, a critical property wherein any external observer can mathematically confirm that the final election outcome was computed correctly from the set of valid ballots, all without compromising the fundamental right to voter privacy.

While current zero-knowledge (ZK) voting platforms, exemplified by foundational systems like the "Cloak-minster Ballot," have demonstrated considerable success in achieving anonymity and verifiability for simple plurality elections through the sophisticated integration of ZK-SNARKs, smart contracts, and decentralized storage, they exhibit critical deficiencies in two areas essential for achieving broader democratic legitimacy and security.

Limited Expressiveness: These systems are architecturally constrained to support only basic electoral models. They lack the inherent capability to handle more sophisticated schemes like Instant Runoff Voting (IRV). IRV is of paramount importance in modern democracies for its ability to minimize vote splitting and elect more broadly representative candidates, but its complex multi-round, vote-transfer logic is fundamentally incompatible with the simple additive tallying mechanisms found in existing platforms.
Vulnerability to Coercion: Existing platforms lack robust mechanisms to protect voters from sophisticated forms of vote-buying and coercion. This vulnerability is particularly acute under the "extreme coercion" model, a realistic and potent threat scenario where an adversary successfully acquires all of a voter's secret cryptographic keys and can thereby compel them to cast a specific vote under duress.

This paper directly confronts these deficiencies by proposing a novel, authority-free e-voting architecture on the high-performance Solana blockchain. We make the following specific and substantial contributions:

A Synthesized Protocol for Expressive and Coercion-Resistant Voting: We design and specify a complete, integrated protocol that combines verifiable IRV tallying with an unstoppable nullification mechanism, achieving a synthesis of features not previously seen in the existing academic literature.
Novel Cryptographic Constructions: We formally specify three new zero-knowledge circuits essential for the system's operation: P_IRV_WF for proving the well-formedness of a matrix-based IRV ballot, P_Shift for verifying the state transition between successive IRV rounds, and P_Nullify for enabling the privacy-preserving, anonymous nullification of a vote.
A Concrete Solana-based Architecture: We provide a detailed architectural blueprint, outlining the necessary modifications to the on-chain Anchor smart contracts and the off-chain IPFS data structures of the Cloak-minster Ballot system to support our advanced features.
Rigorous Security and Performance Analysis: We furnish a formal security model and proofs for the system's core properties, including the critical guarantee of extreme coercion resistance. Furthermore, we analyze the computational costs and demonstrate the system's practical feasibility, highlighting significant performance enhancements achieved through the use of optimized Arithmetization-Oriented (AO) hashing primitives.

2. Preliminaries

Cryptographic Building Blocks

Our system is constructed upon a robust foundation of well-established cryptographic primitives, each selected for its specific security and performance characteristics.

Zero-Knowledge SNARKs: We leverage Zero-Knowledge Succinct Non-Interactive Arguments of Knowledge (ZK-SNARKs), specifically the Groth16 proving system. This system is distinguished by its constant-size proofs, which do not grow with the complexity of the computation being proven, and its extremely fast verification times, making it ideal for on-chain applications. A SNARK protocol involves a prover, who generates a proof demonstrating knowledge of a secret witness for a given computation, and a verifier, who checks the proof's validity. The security of this process relies on a Common Reference String (CRS), a set of public parameters generated during a one-time, multi-party trusted setup ceremony, which produces the proving and verification keys for a specific arithmetic circuit.
Elliptic Curve Cryptography & Pedersen Commitments: Standard Elliptic Curve Cryptography (ECC) forms the basis of the digital signature schemes and public-private key pairs used for authentication and authorization throughout the system. We also make extensive use of Pedersen commitments. These commitments are homomorphic, allowing for computations on committed values, and more importantly, they offer perfect (information-theoretic) hiding. This means the commitment reveals absolutely no information about the committed value, a property that is crucial for achieving everlasting privacy, as even future computational breakthroughs cannot compromise the secrecy of past votes.
Merkle Trees and Sparse Merkle Trees (SMTs): Data integrity and efficient set membership proofs are managed using Merkle trees. Following the model of the baseline Cloak-minster system, standard Merkle trees are employed for proofs of inclusion (e.g., to prove that a voter's identity is part of the official registration roster). In contrast, Sparse Merkle Trees (SMTs) are used for proofs of non-inclusion. This is essential for the spent nullifier list, which prevents double-voting by allowing a voter to prove that their unique nullifier has not yet been used.

Arithmetization-Oriented (AO) Hashing

The performance of ZK-SNARKs is profoundly dependent on the efficiency of the cryptographic hash functions used within their arithmetic circuits.

The Need for ZK-Friendly Hashing: Traditional hash functions like SHA-2, which are designed for optimal performance on bit-oriented CPU architectures, are prohibitively expensive to implement within arithmetic circuits defined over large prime fields. The process of "arithmetization"—translating a computation into a system of polynomial equations, such as a Rank-1 Constraint System (R1CS)—reveals that the dominant cost metric is the circuit's multiplicative complexity, i.e., the number of multiplication gates. The high multiplicative complexity of standard hashes results in enormous circuits, leading to impractically long proof generation times and high computational costs.
Poseidon: To overcome this bottleneck, Arithmetization-Oriented (AO) hash functions were developed. The Cloak-minster system utilizes Poseidon, a widely adopted AO hash function built using a sponge construction. It is specifically designed to have a low number of multiplication gates when expressed as an arithmetic circuit, making it highly efficient for in-circuit operations such as Merkle tree construction and nullifier generation.
Advanced AO Primitives (PGV-ELC): While Poseidon offers significant efficiency gains, recent cryptographic research has yielded even more optimized primitives. The PGV-ELC mode, a blockcipher-based construction, demonstrates superior performance. When instantiated with an AO-friendly blockcipher like HADES (creating a construction we refer to as "POSEIDON-DM"), it can reduce the number of R1CS constraints by up to 50% and improve native Merkle tree construction time by up to 9x compared to the standard sponge-based Poseidon. This substantial performance gain is achieved by leveraging a more efficient modular design and optimizing for wider-arity Merkle trees. This paper leverages these advanced AO primitives to ensure the computational feasibility of our more complex circuits, particularly the state transition proof.

The Solana Programming Model

Our system is architected for deployment on the Solana blockchain, capitalizing on its unique high-performance characteristics.

Anchor Framework: We utilize the Anchor framework, a powerful development suite that significantly simplifies the creation of secure and efficient Solana programs (smart contracts). It provides a Rust-based domain-specific language that abstracts away much of the complexity of the Solana runtime, handling boilerplate for serialization, deserialization, and instruction processing.
Key Concepts: The architecture is built upon core Solana concepts, including Program Derived Addresses (PDAs). PDAs are special accounts that can be programmatically controlled, allowing a smart contract to sign for and own other accounts. This enables the creation of unique, deterministically addressable state for each election, managed entirely by the on-chain logic. The system operates within Solana's account model and is designed to be highly efficient with respect to transaction costs, which are measured in computational units.

Formal Definitions of Advanced Voting Concepts

Instant Runoff Voting (IRV): As formally defined in related work, IRV is a ranked-choice voting system where voters rank candidates in order of preference. If no single candidate secures an absolute majority of first-preference votes, the candidate with the fewest first-preference votes is eliminated. The votes cast for the eliminated candidate are then transferred to the next-ranked preference indicated on those ballots. This iterative process of elimination and vote redistribution continues in rounds until one candidate achieves a majority of the remaining votes.
Extreme Coercion Resistance: This is a strong, formally defined security property designed to model realistic and severe threats. It assumes an adversary can learn all of a voter's secret keys and can observe all of their interactions with the voting system. The only capability the adversary is assumed to lack is access to a private, untappable communication channel between the voter and a trusted agent, or "hedgehog." A system possessing extreme coercion resistance must provide the voter with an unstoppable and unattributable mechanism to nullify a vote that was cast under duress, even after the fact.

3. Architectural Foundation: An Analysis of the Cloak-minster Ballot

To ground our proposed enhancements in a concrete context, we first conduct a thorough technical analysis of the "Cloak-minster Ballot," a representative baseline ZK-voting system that encapsulates the current state of the art for simple, private elections.

System Overview

The Cloak-minster Ballot is a decentralized voting platform architected across four distinct, interacting layers. The Frontend Layer, built with the Next.js framework, provides the user interface for voter registration and ballot casting. The Blockchain Layer employs a Solana smart contract, developed using the Anchor framework, to manage the on-chain election state and verify cryptographic proofs. The Storage Layer leverages the InterPlanetary File System (IPFS) for scalable, decentralized, and content-addressed off-chain storage of large data sets, such as the complete lists of voter and spent nullifiers. Finally, the Zero-Knowledge Layer, utilizing Circom for circuit definition and SnarkJS for proof generation, produces the cryptographic proofs that are the cornerstone of the system's privacy and integrity guarantees.

Component Deep Dive

Election Lifecycle: The system enforces a rigidly defined lifecycle managed by on-chain instructions. An administrator initiates the process by calling an instruction to create the election account, which is a Program Derived Address (PDA). The election then enters a registration phase, where instructions for registering voters and updating the registration Merkle root handle the enrollment of anonymous identities. A specific instruction closes the registration period and transitions the election to the voting phase. During this phase, the vote instruction processes incoming ballots. Finally, a concluding instruction closes the election and finalizes the results. These state transitions are strictly controlled by on-chain boolean flags (is_registration_open, is_voting_open) stored within the main Election account data structure.
The Nullifier Model: The dual goals of anonymity and double-vote prevention are achieved through a sophisticated dual-nullifier scheme. During registration, each voter generates a unique identity_nullifier by executing a specific ZK circuit. This nullifier is added to a Merkle tree of all registered voters, and the root of this tree is stored on-chain. When casting a vote, the voter must provide a ZK proof of membership in this registration tree. To prevent the same voter from casting multiple ballots, they also generate a spent_nullifier (which is unique to the voter and the specific election). This spent nullifier is checked against a Sparse Merkle Tree (SMT) containing all previously spent nullifiers for that election. If the proof of non-membership in the SMT is valid, the vote is accepted, and the new spent nullifier is added to the tree, preventing its reuse.

Critical Evaluation and Research Gaps

While the Cloak-minster system provides a robust and well-architected foundation for simple, private voting, it suffers from two significant and fundamental limitations that prevent its application in more demanding democratic contexts.

Strengths: The architecture is cryptographically sound, providing strong E2E verifiability and anonymity for plurality elections. Its use of the high-performance Solana blockchain and the decentralized IPFS storage layer makes it both scalable and resilient.
Limitation 1: Lack of Expressiveness: The system is fundamentally and architecturally designed for plurality voting. This is most evident in the on-chain data structure of the Election account, which contains a simple array of big numbers (tallies: BN) to count the votes for each option. This flat data structure is incapable of representing the ranked preferences required by IRV or handling the complex, multi-round vote transfer logic that is central to its tallying process.
Limitation 2: Lack of Coercion Resistance: The system's security model does not address the threat of voter coercion. The cryptographic credential that authorizes a vote, referred to as a "voucher," bundles the Merkle proof of registration and the identity nullifier. An adversary who obtains a voter's secret key can generate this voucher and cast a vote on their behalf. The system possesses no mechanism to detect or remediate this form of attack, rendering it vulnerable under the extreme coercion threat model where such key theft is assumed to be a feasible attack vector. The design of this voucher system crystallizes the vulnerability; it becomes the single point of failure under coercion. An adversary's objective simplifies to controlling the generation or transmission of this single credential. Consequently, any robust defense cannot merely focus on protecting the secrecy of the inputs to the voucher; it must assume the voucher itself can be compromised. This critical insight directly motivates the introduction of a separate, uncoercible action that can be taken after a coerced vote has been cast, which is precisely the role of the post-facto nullification mechanism facilitated by the out-of-band hedgehog model.

4. A Novel Architecture for Coercion-Resistant IRV

To address the identified limitations of existing systems, we propose a new architecture that extends the Cloak-minster baseline. Our design introduces new on-chain state representations, new ZK circuits, and a modified election lifecycle to natively support both the expressive power of IRV and the robust security of the hedgehog/nullification mechanism for extreme coercion resistance.

Modified State Representation

IRV Ballot Matrix: The fundamental concept of a vote is redefined. Instead of representing a single choice, a voter's ballot is encoded as an encrypted n×n permutation matrix, where n is the number of candidates. In this matrix, each row corresponds to a preference level (1st, 2nd, etc.), and each column corresponds to a specific candidate. A '1' at position (i,j) signifies that candidate j is the voter's i-th preference. To manage on-chain costs and maintain scalability, this matrix is stored off-chain on IPFS, with only its cryptographic commitment being recorded on the Solana blockchain.
On-Chain State Modifications (Anchor): The Election account PDA, managed by the on-chain Anchor program, is extended with several new fields to handle the increased complexity of IRV and the nullification mechanism :
- election_type: ElectionType: An enum to differentiate between Plurality and IRV election types, allowing the contract to enforce the correct logic.
- current_round: u8: A counter to track the current round of the IRV tallying process.
- eliminated_candidates: Vec<PublicKey>: A dynamically sized list that stores the public keys of candidates who have been eliminated in previous rounds.
- round_tallies: Vec<Vec<BN>>: A nested vector that stores the vote tallies for each candidate in each round of the election.
- nullified_ballots: MerkleTree: A new on-chain Merkle tree root to immutably and verifiably track the identities of voters whose ballots have been nullified.
Voter Identity for Nullification: To enable the hedgehog mechanism, the voter registration process is adapted from the Votexx model. Each voter is now required to generate and register two distinct public-private key pairs:
- pk_vote: The public key used to authorize the submission of the voter's ballot.
- pk_nullify: The public key whose corresponding secret key, sk_nullify, is required to authorize the nullification of the vote. This "opposite key" dynamic creates a crucial separation of concerns: one key is used for the act of voting, while the other, which can be securely shared with a trusted hedgehog agent, is reserved for remediation in the event of coercion.

The Complete Election Lifecycle (Modified)

Election Initialization: The instruction to initialize an election is updated to accept an election_type parameter and to initialize the new IRV-specific state variables on the Election account.
Voter Registration: The voter registration process is modified. A voter now commits to both pk_vote and pk_nullify on-chain, establishing a verifiable and unbreakable cryptographic link between their voting identity and their nullification identity.
Ballot Casting: Voters generate their IRV ballot matrix, encrypt it, and then generate a ZK proof (P_IRV_WF) demonstrating that the encrypted data corresponds to a valid permutation matrix. This proof, along with the commitment to the ballot, is submitted in a single transaction to the blockchain.
IRV Tallying and Elimination Rounds: This new phase is managed by a designated but fully verifiable tallying authority (e.g., a DAO or a set of decentralized trustees). It proceeds in discrete rounds:

*   **Tally Round:** The authority tallies the first-preference votes from the current set of active ballot matrices.

*   **Eliminate Candidate:** The candidate with the fewest votes is identified, and their public key is added to the on-chain `eliminated_candidates` list.

*   **Verifiable Shift:** This is the most cryptographically intensive and critical step. The authority executes the shifting procedure described in the Camel protocol for all active ballots, which effectively removes the eliminated candidate and promotes lower-ranked candidates on each ballot. To ensure verifiability without incurring prohibitive on-chain costs, the authority computes this global state transition off-chain and generates a single, comprehensive ZK proof (`P_Shift`). This proof, which attests to the correctness of the entire round's state transition, is then posted to and verified by the Solana smart contract. This process repeats until a candidate achieves a majority.

Nullification Phase: A dedicated time window, typically opened after voting closes but before the final tally is certified, is designated for nullification. During this phase, hedgehogs (or the voters themselves) can submit a nullification transaction. This transaction contains a ZK proof (P_Nullify) of knowledge of a secret key sk_nullify that corresponds to a registered voter. Upon successful on-chain verification of this proof, the voter's identity is added to the nullified_ballots Merkle tree.
Finalization: The final tally calculation is performed by the authority, which must discount any ballots cast by voters whose identities appear in the nullified_ballots tree. The final, verified result is then published.

The verifiable shift mechanism is a cornerstone of this architecture's practicality. A naive implementation where each voter's ballot matrix is shifted and verified in a separate on-chain transaction would be computationally and financially infeasible due to the immense aggregate costs. The architectural decision to delegate the global state transition to a centralized-but-verifiable authority, who then proves the correctness of the entire operation with a single ZK-SNARK, is what makes the system viable. The state of all ballots can be represented by a single Merkle root of their commitments. The IRV round transition is thus a deterministic function: (Old_State_Root, Eliminated_Candidate) -> New_State_Root. This entire function can be encoded into the P_Shift circuit. While this proof is large and computationally intensive to generate off-chain, it is constant-size and efficient to verify on-chain, perfectly aligning with the ideal model for blockchain scalability.

5. Core Cryptographic Protocols and ZK Circuits

This section provides the detailed technical specifications for the three new zero-knowledge circuits that form the cryptographic core of our proposed system. These circuits are designed for implementation in a framework like Circom, which is compatible with the Groth16 proving system.

`P_IRV_WF`: Proof of Well-Formed IRV Ballot

Purpose: To allow a voter to prove that their submitted encrypted ballot corresponds to a valid n×n permutation matrix without revealing their ranked preferences. This protocol adapts the well-formedness proof from the Camel protocol.
Public Inputs: A cryptographic commitment to the voter's ballot matrix.
Private Inputs: The plaintext n×n ballot matrix Vk, and the randomness used to generate the commitment.
Circuit Logic: The circuit enforces the two fundamental mathematical properties of a permutation matrix:

1.  Binary Constraint: Each element (vij)k in the matrix must be either 0 or 1. This is enforced within the arithmetic circuit with the constraint vij⋅(vij−1)=0 for all matrix elements i,j.


 Sum Constraint: The sum of the elements in each row and each column must be exactly 1. This ensures that each preference rank is assigned to exactly one candidate, and each candidate is assigned exactly one preference rank, preventing invalid ballots.

`P_Shift`: Proof of Correct Elimination Round

Purpose: To enable the tallying authority to generate a single, verifiable proof that it correctly transitioned the global state of all ballots from one IRV round to the next by correctly applying the elimination and shifting logic.
Public Inputs: The Merkle root of all ballot commitments from round m−1, the public key of the candidate α(m−1) eliminated in that round, and the new Merkle root of ballot commitments for round m.
Private Inputs: The complete set of plaintext ballot matrices from round m−1, along with their corresponding Merkle proofs of inclusion in the old state root.
Circuit Logic: This is the most complex circuit in the system. Its logic can be deconstructed as follows:

1.  For each ballot in the set, the circuit first verifies its Merkle proof against the old state root.


It then simulates the shifting procedure from the Camel protocol. It identifies the row corresponding to the eliminated candidate α(m−1) and computationally "removes" it, shifting all subsequent rows up by one position.
It computes a new cryptographic commitment to the newly formed ballot matrix for round m.
Finally, it uses all the new ballot commitments to compute the new global Merkle root and asserts that this computed root matches the public input root for round m.

`P_Nullify`: Proof of Nullification Authority

Purpose: To allow a hedgehog agent to anonymously prove they possess the authority to nullify a specific voter's ballot, without revealing which voter they are acting for. This protocol adapts concepts from Votexx (proof of key ownership) and linkable ring signatures (anonymous proofs of set membership).
Public Inputs: The public key of the voter whose ballot is being nullified (pk_vote), and the Merkle root of all registered nullification keys (pk_nullify).
Private Inputs: The secret nullification key sk_nullify, and the Merkle proof demonstrating that the corresponding pk_nullify is part of the registered set of nullification keys.
Circuit Logic: The circuit performs three critical cryptographic checks:

1.  Membership Verification: It takes the public pk_nullify (derived from the private sk_nullify) and the private Merkle proof, and verifies the proof against the public Merkle root of the nullification key roster.


Key Ownership: It performs the elliptic curve scalar multiplication inside the circuit (pknullify=sknullify⋅G) to derive the public key from the private key and asserts that it matches the public key used in the membership proof.
Linkage Verification: It verifies the cryptographic link between pk_vote and pk_nullify that was established and committed to during the registration phase. This ensures that the key being used for nullification is legitimately authorized to act on behalf of the key that cast the original vote.

Table: Cryptographic Primitive Mapping

The following table maps the high-level concepts from the source research to their concrete implementations in our proposed system, demonstrating the synthesis of these ideas.

System Component	Core Concept	Source Paper	Proposed Implementation
Ballot Structure	Matrix-based IRV Ballot	Camel	Encrypted n×n permutation matrix stored on IPFS, with its commitment stored on-chain.
IRV Round Logic	Verifiable Shifting	Camel	`P_Shift` ZK circuit proving correct global state transition between rounds.
Coercion Resistance	Hedgehog/Nullification	Votexx	`P_Nullify` ZK circuit proving knowledge of `sk_nullify` for anonymous nullification.
ZK Optimization	Arithmetization-Oriented Hashing		Replace Poseidon with the more efficient POSEIDON-DM (HADES in PGV-ELC mode) in all circuits to reduce constraints.
Privacy Model	Everlasting Privacy		Use perfectly hiding Pedersen commitments for the public bulletin board, keeping computationally secure ElGamal ciphertexts private to the tallying authority.
General Compliance	Polynomial Consistency Checks	CONAN	(Future Work) A framework for extending the system beyond IRV to other complex voting rules

6. Security and Performance Analysis

Security Guarantees

Verifiability: The system achieves robust end-to-end verifiability. The integrity of the entire election is anchored to the public bulletin board on the Solana blockchain. Any external observer can download the P_Shift proof for each IRV round and independently verify it, confirming that the tallying process was executed precisely according to the specified rules. Similarly, the nullified_ballots Merkle root provides a transparent, immutable, and verifiable record of all discounted votes.
Ballot Privacy: Ballot secrecy is computational, relying on the hardness of the discrete logarithm problem for the ElGamal encryption scheme and the zero-knowledge property of the SNARKs. To elevate this guarantee to everlasting privacy, we adopt the model proposed by Pointcheval , which makes a critical distinction between public and private data. The public bulletin board would only store perfectly hiding commitments (e.g., Pedersen commitments) of the ballots. The computationally secure ElGamal ciphertexts, which are needed for the homomorphic tally, would be shared privately with the tallying authority and can be securely deleted after the election. This ensures that even a future break in the encryption scheme (e.g., due to quantum computers) would not compromise the privacy of past votes.
Extreme Coercion Resistance: The system provides demonstrable resistance against the extreme coercion model. A formal security argument, following the Universal Composability (UC) framework utilized in the Votexx paper , can be constructed. The security of the mechanism hinges on the core assumption of an untappable channel between the voter and their hedgehog, which exists outside the blockchain protocol itself. The protocol's role is to ensure that once a hedgehog is activated through this external channel, their nullification action on the blockchain is unstoppable, irrevocable, and anonymous. An adversary who knows both sk_vote and sk_nullify can cast a vote but cannot prevent the hedgehog from nullifying it, as they cannot block the external communication channel or forge a nullification proof for a different voter without the correct secret key.

Performance Analysis and Optimization

Constraint Complexity: The asymptotic R1CS constraint complexity of the new circuits is a key performance indicator for ZK-proof generation time.
- P_IRV_WF: The complexity is dominated by the row and column sum checks, resulting in O(n2) constraints for an n-candidate election.
- P_Shift: This is the largest and most computationally intensive circuit. Its complexity is approximately O(N⋅n2), where N is the number of voters, as it must process every ballot in the state transition.
- P_Nullify: The complexity is dominated by the Merkle proof verification, resulting in O(logR) constraints, where R is the total number of registered voters.
Impact of AO Hashing: The use of optimized AO hashing is critical for making the P_Shift circuit practical. Based on the performance analysis in , replacing a standard sponge-based Poseidon with a more efficient primitive like POSEIDON-DM can reduce the number of constraints in the hashing-intensive parts of the circuit by up to 50%. For a large, complex circuit like P_Shift, where Merkle roots are repeatedly computed, this could translate into an overall proof generation time reduction of 20-30% or more.
Solana Transaction Costs: The on-chain costs are primarily driven by data storage and proof verification.
- Voter Transactions: Casting a ballot involves storing a commitment and verifying the P_IRV_WF proof, which is a relatively low-cost transaction.
- Authority Transactions: The most expensive on-chain operations are performed by the tallying authority (posting the P_Shift proof for each IRV round) and by hedgehogs (submitting P_Nullify proofs). The cost is dominated by the on-chain verification of the Groth16 proofs, which, while efficient, still consumes computational resources.
- Parallelism: Solana's Sealevel architecture, which allows for the parallel processing of non-overlapping transactions, is highly advantageous for the nullification phase. A large number of hedgehogs can submit their nullification proofs simultaneously, and the network can process them in parallel, preventing a potential bottleneck and ensuring the system remains responsive even under high load.

7. Conclusion and Future Work

Summary of Contributions

We have presented the complete and detailed design for a novel electronic voting system that significantly advances the state of the art by simultaneously providing both expressive voting capabilities (Instant Runoff Voting) and robust security against extreme coercion. By synthesizing and extending concepts from multiple leading research papers—including matrix-based ballots from the Camel protocol, the hedgehog/nullification model from Votexx, and performance optimizations from Arithmetization-Oriented hashing research—and grounding them in a concrete, high-performance architecture on the Solana blockchain, we have demonstrated a clear and practical path toward more secure, flexible, and democratic online elections. Our work serves to bridge the critical gap between theoretical cryptographic possibilities and practical, deployable systems ready for real-world application.

Future Work

Generalized Compliance (CONAN): The IRV matrix representation is a specific method for enforcing a complex voting rule. A promising direction for future work is to generalize this framework to support a wider range of rules. By incorporating the polynomial-based consistency checks and decoy-term methods from the CONAN protocol , the system could be transformed into a generic "verifiable compliance engine." This would allow it to support a diverse array of voting schemes, such as quadratic voting or Condorcet methods, by simply defining a new compliance predicate circuit, thereby making the platform significantly more versatile.
Decentralized Tallying Authority: The current model relies on a designated, albeit fully verifiable, tallying authority to execute the IRV rounds. A significant step towards complete decentralization would be to distribute this role among a network of nodes that collaboratively compute the round transitions and the P_Shift proof using secure Multi-Party Computation (MPC) protocols.
Post-Quantum Security: The system, as currently designed, relies on elliptic curve cryptography, which is known to be vulnerable to attacks from large-scale quantum computers. A critical avenue for future research is to investigate and integrate post-quantum cryptographic primitives. This would involve replacing the current signatures, commitments, and SNARKs with lattice-based alternatives, such as the blind signatures proposed in , to ensure the long-term security and enduring viability of the voting system in a post-quantum world.