DEV Community

Pragadeesh
Pragadeesh

Posted on

I built a Graph database to catch money launderers. Here's what I actually learned.

I want to say upfront: I have not caught any money launderers. I built a database. Whether it would actually catch money launderers in production is a question I can't answer yet, because I have zero production users. That caveat matters and I'll come back to it.

Here's what happened.


The problem I kept reading about

Every AML compliance team I could find publicly describing their stack was running some version of the same setup: a graph database for relationship traversal, a vector database or fuzzy matching library for name similarity, and a service layer stitching them together. Quantexa runs Spark plus Elasticsearch plus postgreSQL plus a graph layer. ComplyAdvantage built a transformer-based name embedding model and runs it against FAISS for sanctions screening, while keeping a separate proprietary graph database for entity relationships. Neo4j has published architecture diagram explicitly recommending you pair of their graph database with Pinecone for the vector part.

These are not small companies running shoddy systems. These are well-funded teams with smart engineers. They built this way because no single system did both things natively. So every team independently arrived at the same two-component architecture.

I wanted to know if that was actually necessary.


The core idea

Vector Symbolic Architecture is a field from cognitive computing that represents concepts as high-dimensional binary vectors and uses simple bitwise operations to encode relationships. XOR two vectors and you get a binding that associates them. Permute a vector and you get a role-encoded version of it. Bundle vector together and you get a superposition that's close to all of them.

The interesting property for a database: if you encode a typed edge as
bind(permute(subject), bind(permute(relation), permute(object))), you can query it back with just the subject and relation vectors, because XOR is its own inverse. The object vector emerges from the query. No index traversal. No query planning. Just bitwise arithmetic on fixed-size vectors.

I thought: what if you built a graph database where every entity and every typed edge is stored as one of these binary hypervectors? you'd get graph traversal and vector similarity search in the same data structure. One HNSW index. One memory-mapped hash table. One gRPC call for a multi-hop chain.

So i built it.


What it actually took

Longer than I expected. The HNSW implementation took weeks to get right - there's a subtle bug where the entry point offset is stored as a layer-0 byte offset but gets treated as a layer-N offset after restart, which causes out-of-bounds memory access that only manifests on the second search after loading a large graph. Finding that took a while.

The persistence layer was harder still. I initially had the edge lookup table - the structure that maps subjectId XOR relationId -> objectId for O(1) chain traversal - as a ConcurrentHashMap in JVM heap. Which meant every server restart wiped it. Chain queries would return null until you re-ingested all the edges. For a 1.87M entity dataset that takes 12 hours. I fixed it this week with a memory-mapped WAL - each edge gets appended as 16 bytes to a mapped file before the map update, and on startup the log is replayed into a fresh HashMap in about 4 milliseconds. The fix is obvious in retrospect. I'm embarrassed it took this long to ship.

The MiniLM encoder integration was surprisingly painless - ONNX Runtime on the JVM, 90MB model, produces 384-dimension float embeddings that get projected to 10,048-bit binary vectors via a seeded random matrix. The projection is deterministic so it regenerates on startup from the seed rather than persisting 30MB of projection weights to disk.

The Spring Boot starter took about as long as the core engine. Not because it was technically hard but because there are a lot of edge cases in autoconfiguration - what happens when Micrometer isn't on the classpath, how the gRPC channel pool interacts with graceful shutdown, how to wire the Watch API without creating a circular dependency in the bean graph. That kind of thing.


The Panama Papers demo

I loaded the full ICIJ Offshore Leaks dataset - Panama Papers, Paradise Papers, and Pandora Papers combined - 1.87 million entities. It took 12 hours on my machine because each entity name has to go through MiniLM for the initial float embedding, and I was running 4 parallel encoder threads on a consumer CPU.

The resulting database is about 7GB on disk. Vector store, HNSW layers, entity index, edge log. All memory-mapped. Server starts in about 5 seconds and the whole thing sits in about 4.5GB of
off-heap memory - no JVM heap pressure because the hot path is entirely Foreign Function and Memory API.

The benchmark numbers I'm willing to stand behind:

  • 4-hop beneficial ownership chain traversal: 3.65ms average on 1.87M entities
  • Fuzzy entity screening (name match across all three leaks): 886ms
  • Shell company risk scoring: 290-1748ms depending on graph depth

The 886ms for screening is slower than I'd like. It's going through HNSW on 1.87M vectors and there's real room to optimize the query path. The 3.65ms for chain traversal is the number I'm most confident in - it's a tight operation and I've measured it many times.

The shell risk score is the thing I find most interesting to think about. During ingestion, every Panama Papers entity vector contributes to a majority-vote tally across all 384 dimensions. After ingestion, I threshold the tally at 50% to get a prototype vector - the statistical centroid of roughly half a million real offshore shell companies in binary hypervector space. Any new entity gets a risk score of Hamming(entity, prototype) / 10048. It's not a classifier. It has no labels. It's just: how far is this entity, in the high-dimensional space of financial crime patterns, from the average Panama Papers company?

I genuinely don't know if that's useful in production. It's theoretically interesting. Whether a compliance analyst would trust it for an actual screening decision is a different question.


What the architecture gets wrong

Three things bother me.

First, the analogy query - client.analogy("Mossack Fonseca", "registers").isTo("Panama") - has no equivalent in Neo4j or any vector database I know of. Find me entities that have the same structural relationship to their jurisdiction that Mossack Fonseca has to Panama. That's genuinely novel. I've not seen it elsewhere. But I also can't tell you whether a compliance engineer running sanctions screening at 3am would ever ask that question, or whether it's one of those capabilities that's elegant in theory and never quite fits a real workflow.

Second, the benchmarks are on Windows on my laptop. Every time I post numbers someone will ask about Linux, about cloud VMs, about comparative results against Neo4j on the same hardware. Those numbers don't exist yet. That's a gap.

Third, this is a solo project. No enterprise is going to depend on a database maintained by one person. The path from "interesting technical work" to "thing banks will run in production" is long and requires more than good code. It requires SOC2 certification, enterprise support contracts, multi-year stability guarantees, and a team that's not going to disappear. I'm one person. That's not a path I can walk alone.


Why I'm posting this

I spent a few months on this. I want to find out if the underlying idea - unified binary representation for graph traversal and vector search, no separate vector database, embeddable in a JVM process - is actually useful to people building financial crime detection systems, or whether it's an interesting technical exercise that doesn't map to any real production need.

If you work on AML systems, financial crime technology, graph databases, or JVM infrastructure and you have an opinion about whether any of this is useful, I'd rather hear it now than spend another six months building features nobody wants.

GitHub: https://github.com/Pragadeesh-19/HammingStore

The AML demo: https://github.com/Pragadeesh-19/hammingstore-aml-demo

Top comments (0)