I built an AI agent that proves a crypto wallet is hacked — on Qwen Cloud

#ai #security #blockchain #qwen

Building ChainSleuth for the Global AI Hackathon Series with Qwen Cloud (Track 4: Autopilot Agent).

There's a single mistake a crypto wallet can make that hands its private key to the entire world. Not a leaked seed phrase, not a phishing link — a subtle cryptographic slip called nonce reuse. When it happens, anyone looking at the public blockchain can recover the key with a few lines of arithmetic.

It broke the PlayStation 3's code-signing in 2010. It drained Android Bitcoin wallets in 2013. And it still shows up on-chain today.

I wanted to build an AI agent that hunts for it — and, crucially, one you can actually trust. That became ChainSleuth, and I built it on Qwen Cloud. Here's the journey.

The flaw, in one paragraph

Every ECDSA signature is a pair (r, s), where r depends only on a random number k (the "nonce") chosen per signature. If a wallet ever signs two different transactions with the same k, both signatures get the same r — a glaring fingerprint. From two such signatures you can solve two equations for two unknowns and pop out the private key. One reused nonce, and the wallet is gone.

The hard part isn't the math — it's trust

Here's the problem with pointing an LLM at security: LLMs hallucinate. The last thing the world needs is an "AI security tool" that confidently declares a wallet compromised when it isn't. A forensic verdict has to be true, not plausible.

So I built ChainSleuth around one non-negotiable rule:

The LLM proposes and explains. Deterministic math proves.

The architecture is a five-agent pipeline:

Planner → Collector → SignatureAnalyst → CryptoVerifier → Reporter
 (Qwen)               (Qwen triage)     (pure math)      (Qwen)

The Planner, Analyst, and Reporter are Qwen agents — they scope the audit, triage the signatures, and write the human-readable forensic report. But the CryptoVerifier has no LLM in it at all. It runs the real secp256k1 recovery and then verifies the recovered key against the on-chain signatures. A finding is only ever reported if the key mathematically checks out. An agent's narrative can never become the verdict.

That single design decision is what makes the output credible instead of a confident guess.

Building the agents on Qwen Cloud

The agents run on qwen-max via Alibaba Cloud Model Studio (DashScope). A few things that stood out:

The OpenAI-compatible endpoint made wiring trivial. I didn't even need an SDK — a plain stdlib HTTP call to the DashScope compatible-mode endpoint was enough to get the whole agent loop running. Flip one env var and the mock pipeline became a live one.
qwen-max is genuinely good at structured reasoning. The Planner produces concrete, well-organized audit plans, and the Reporter writes incident reports with severity and remediation that read like a human analyst wrote them — not generic filler.
It's all Alibaba Cloud. Inference runs on Alibaba Cloud Model Studio; reports archive to Alibaba Cloud OSS.

The best part: because the crypto is deterministic, I developed the entire pipeline offline in a free "mock" mode and only flipped to live Qwen at the end. The findings were identical — Qwen just added the reasoning and the prose on top.

Watching it work on the real chain

A demo on a planted sample is one thing. So I pointed it at live Ethereum. ChainSleuth pulled real transactions, reconstructed each signing hash with pure-Python keccak-256 + RLP decoding, and scanned thousands of real signatures.

I scanned 5,320 live signatures across 20 recent blocks. Reused nonces found: zero.

That's not a failure — that's the lesson. Modern wallets use deterministic nonces (RFC 6979), so they physically can't reuse one. A healthy chain looks clean. ChainSleuth correctly clears a safe wallet, which is most of what a real audit does.

Then, for fun, I aimed it at the most famous coins in existence: Satoshi Nakamoto's earliest signatures. Could the legendary ~1.1M BTC be exposed by a nonce slip in 2009? I indexed those early signatures and checked.

Every one used a distinct nonce. Cryptographically flawless. Fifteen years of the entire world scrutinizing those coins, independently re-confirmed in an afternoon: Satoshi's signing was meticulous. No door, just a very solid wall.

The ethics are part of the architecture

A tool that recovers private keys is dual-use, and I took that seriously. ChainSleuth only ever produces a key for a wallet that already leaked it on the public chain — it cannot touch a correctly-signing wallet. And recovered keys are masked by default, revealed only after a human analyst confirms at a checkpoint. It's built for responsible disclosure — detect exposed wallets so owners can be warned, never to move funds. The line is simple: audit what you own or are authorized to check; never go fishing in other people's wallets.

What I learned

The biggest takeaway is bigger than crypto: the most trustworthy way to use an LLM is to wrap it around a deterministic core. Let it plan, triage, and explain — the things it's genuinely great at — and never let it be the source of truth for a claim that has to be correct. Qwen made the agents feel like a real forensic team; the math made them honest.

Try it

ChainSleuth is open source: https://github.com/Vinny010/chainsleuth

Built solo for the Global AI Hackathon Series with Qwen Cloud — Track 4: Autopilot Agent, on qwen-max + Alibaba Cloud Model Studio + OSS.

When a wallet reuses a nonce, its private key is already gone. ChainSleuth turns that subtle cryptographic flaw into a one-click, provable forensic verdict — and it runs on Qwen.