I tested 16 legal AI tools on one nasty prompt. Here's what broke.

Stephane Boghossian — Fri, 22 May 2026 17:27:49 +0000

I'm not a lawyer. I'm not really an engineer either — I ship with AI. I run growth at a legal AI company called HAQQ, so treat this as biased and read it anyway. The test was the same for everyone, including us.

The prompt

One real, ugly task: a UAE-fintech SAFE. Cross-border. Money involved. The kind of thing where a wrong clause isn't a typo, it's a liability.

I gave the exact same prompt to 16 legal AI tools and scored each on 10 dimensions — drafting, jurisdiction-awareness, citations, hallucination rate, handling of ambiguity, and so on.

What I learned (the parts that matter if you build with LLMs)

1. Jurisdiction is where general models fall apart.
A model that drafts a beautiful Delaware-flavored SAFE is worse than useless in the UAE — it's confidently wrong. The tools that scored well weren't smarter, they were grounded: retrieval over actual local law, not vibes. If you're building anything domain-specific, your moat is the grounding layer, not the base model.

2. Ambiguity is a feature, not a bug — and most tools hide it.
Law is full of "it depends." The tools I trusted least were the ones that gave one slick answer with no hedging. The ones I trusted most surfaced the fork: "under interpretation A… under interpretation B…" If your AI never says I'm not sure, here's why, it's lying smoothly.

3. Arabic RTL is genuinely hard.
Right-to-left rendering, mixed LTR/RTL in one contract, legal terms that don't translate cleanly. Most tools treated Arabic as an afterthought. If your users aren't all in San Francisco, this is real engineering, not a locale flag.

4. Latency changes behavior.
When an answer takes 40 seconds, people stop asking. When it's instant, they ask ten times and actually learn. Speed isn't a nice-to-have in tools people use under pressure — it's the difference between a tool and a toy.

5. Evals beat demos.
Every one of these 16 tools has a gorgeous demo. The demo tells you nothing. A boring, repeatable eval on a hard real task told me everything. If you ship AI and you don't have an eval harness, you don't know if your last "improvement" made it worse.

So who won?

We did — HAQQ scored 49. I'd be a bad growth lead if I buried that. But the scorecard is the point, not the winner: the gap between tools was almost entirely grounding + honesty about uncertainty + speed, not raw model quality. Everyone's standing on the same models. The work is everything around them.

The pocket part

The thing that surprised me most: once it's fast and grounded, legal AI stops being a desk tool. We put it in a phone — HAQQ's mobile app — and people started using it before they signed things, not after they got burned. "Know where you stand before you sign" turned out to be a different product than "draft my contract."

If you build AI for a hard domain, steal the lessons, skip the hype. And if you want to see what grounded-and-fast feels like in a vertical, the app is here.

What's your eval setup for domain-specific AI? Genuinely asking — that's the part nobody talks about.

DEV Community: Stephane Boghossian

I tested 16 legal AI tools on one nasty prompt. Here's what broke.

The prompt

What I learned (the parts that matter if you build with LLMs)

So who won?

The pocket part