<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Stephane Boghossian</title>
    <description>The latest articles on DEV Community by Stephane Boghossian (@stephane_boghossian_70a98).</description>
    <link>https://dev.to/stephane_boghossian_70a98</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3946541%2Fff9221fe-1b88-47ff-b1b8-52455e85091a.jpg</url>
      <title>DEV Community: Stephane Boghossian</title>
      <link>https://dev.to/stephane_boghossian_70a98</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/stephane_boghossian_70a98"/>
    <language>en</language>
    <item>
      <title>I tested 16 legal AI tools on one nasty prompt. Here's what broke.</title>
      <dc:creator>Stephane Boghossian</dc:creator>
      <pubDate>Fri, 22 May 2026 17:27:49 +0000</pubDate>
      <link>https://dev.to/stephane_boghossian_70a98/i-tested-16-legal-ai-tools-on-one-nasty-prompt-heres-what-broke-21ih</link>
      <guid>https://dev.to/stephane_boghossian_70a98/i-tested-16-legal-ai-tools-on-one-nasty-prompt-heres-what-broke-21ih</guid>
      <description>&lt;p&gt;I'm not a lawyer. I'm not really an engineer either — I ship with AI. I run growth at a legal AI company called HAQQ, so treat this as biased and read it anyway. The test was the same for everyone, including us.&lt;/p&gt;

&lt;h2&gt;
  
  
  The prompt
&lt;/h2&gt;

&lt;p&gt;One real, ugly task: a UAE-fintech SAFE. Cross-border. Money involved. The kind of thing where a wrong clause isn't a typo, it's a liability.&lt;/p&gt;

&lt;p&gt;I gave the exact same prompt to 16 legal AI tools and scored each on 10 dimensions — drafting, jurisdiction-awareness, citations, hallucination rate, handling of ambiguity, and so on.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned (the parts that matter if you build with LLMs)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Jurisdiction is where general models fall apart.&lt;/strong&gt;&lt;br&gt;
A model that drafts a beautiful Delaware-flavored SAFE is &lt;em&gt;worse&lt;/em&gt; than useless in the UAE — it's confidently wrong. The tools that scored well weren't smarter, they were &lt;em&gt;grounded&lt;/em&gt;: retrieval over actual local law, not vibes. If you're building anything domain-specific, your moat is the grounding layer, not the base model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Ambiguity is a feature, not a bug — and most tools hide it.&lt;/strong&gt;&lt;br&gt;
Law is full of "it depends." The tools I trusted least were the ones that gave one slick answer with no hedging. The ones I trusted most surfaced the fork: "under interpretation A… under interpretation B…" If your AI never says &lt;em&gt;I'm not sure, here's why&lt;/em&gt;, it's lying smoothly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Arabic RTL is genuinely hard.&lt;/strong&gt;&lt;br&gt;
Right-to-left rendering, mixed LTR/RTL in one contract, legal terms that don't translate cleanly. Most tools treated Arabic as an afterthought. If your users aren't all in San Francisco, this is real engineering, not a locale flag.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Latency changes behavior.&lt;/strong&gt;&lt;br&gt;
When an answer takes 40 seconds, people stop asking. When it's instant, they ask ten times and actually learn. Speed isn't a nice-to-have in tools people use under pressure — it's the difference between a tool and a toy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Evals beat demos.&lt;/strong&gt;&lt;br&gt;
Every one of these 16 tools has a gorgeous demo. The demo tells you nothing. A boring, repeatable eval on a hard real task told me everything. If you ship AI and you don't have an eval harness, you don't know if your last "improvement" made it worse.&lt;/p&gt;

&lt;h2&gt;
  
  
  So who won?
&lt;/h2&gt;

&lt;p&gt;We did — HAQQ scored 49. I'd be a bad growth lead if I buried that. But the scorecard is the point, not the winner: the gap between tools was almost entirely &lt;em&gt;grounding + honesty about uncertainty + speed&lt;/em&gt;, not raw model quality. Everyone's standing on the same models. The work is everything around them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pocket part
&lt;/h2&gt;

&lt;p&gt;The thing that surprised me most: once it's fast and grounded, legal AI stops being a desk tool. We put it in a phone — &lt;a href="https://haqq.ai/mobile-app" rel="noopener noreferrer"&gt;HAQQ's mobile app&lt;/a&gt; — and people started using it &lt;em&gt;before&lt;/em&gt; they signed things, not after they got burned. "Know where you stand before you sign" turned out to be a different product than "draft my contract."&lt;/p&gt;

&lt;p&gt;If you build AI for a hard domain, steal the lessons, skip the hype. And if you want to see what grounded-and-fast feels like in a vertical, the &lt;a href="https://haqq.ai/mobile-app" rel="noopener noreferrer"&gt;app is here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;What's your eval setup for domain-specific AI? Genuinely asking — that's the part nobody talks about.&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>ai</category>
      <category>legaltech</category>
      <category>startup</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
