Forms Legal

Posted on Apr 23

The Multilingual Legal Document Problem: How We Structured 11,000 Templates Across 21 Jurisdictions and 7 Languages

#legal #startup #contracts

TL;DR — Legal content is not translatable. A Spanish NDA is not "the Spanish version of a US NDA" — it's a different legal instrument governed by a different statute. Treating cross-jurisdiction legal equivalents as translations (with hreflang tags) is the single biggest architectural mistake we made, and the one we see repeatedly elsewhere. This post walks through what we tried, what broke, and what finally worked for indexing 11,000+ templates across 21 jurisdictions in 7 languages.

The Problem, Stated Honestly

If you take a naive combinatorial view of our content domain, the shape of the problem looks like this: 11,000 template concepts × 21 jurisdictions × up to 7 languages. That's a potential state space of ~1.6 million pages. It isn't. The real number is around 11,000 — because not every document-jurisdiction-language tuple is meaningful, and many of them, if produced, would be legally wrong.

Here's the distinction that matters and that took us longer than we'd like to admit to internalise:

"Non-Disclosure Agreement (USA)" in English is one document.
"Vertraulichkeitsvereinbarung (Deutschland)" in German is a different document. It references the Geschäftsgeheimnisgesetz (the German Trade Secrets Act implementing EU Directive 2016/943), it uses German civil-law framing, and it's enforceable in German courts. It is not a translation of the US NDA. The US NDA does not exist as a German document at all — because US trade-secret doctrine (UTSA + DTSA 2016) doesn't map onto German law.
"Acuerdo de Confidencialidad para Due Diligence (España)" in Spanish is a third document, not a translation of either. It references Ley 1/2019 de Secretos Empresariales, follows Spanish civil-procedure norms for M&A due diligence, and is enforceable in Spanish courts.

These three documents sit under the same conceptual umbrella — "confidentiality agreement" — but they are legally, linguistically, and structurally distinct. A user searching Vertraulichkeitsvereinbarung in Germany and a user searching NDA template in the US are looking for different artifacts, not different languages of the same artifact.

This is the core modeling insight, and almost every multilingual content architecture we've seen in the legal, medical, and financial verticals gets it wrong the same way.

Three Architectural Mistakes We Made First

We'll cover each one briefly, because they're load-bearing. If you are about to build anything similar, these are the traps.

Mistake 1: Language-First URL Structure

Our first instinct was the obvious one: /en/, /es/, /pt/, /fr/, /de/. This is what most multilingual sites do. It is also what most multilingual sites use hreflang to glue together.

This broke almost immediately. Under /en/nda-germany, we had an English-language explanation of a German NDA. But that conflicted with /en/nda-usa, which was a US NDA in English. Google's ranking signals collapsed both into generic "NDA" content, and intent-specific German-jurisdiction searches went to the US page.

The problem: we were signalling "language" when we needed to signal "jurisdiction + language."

Mistake 2: Machine-Translated Content From a Single Source of Truth

Our second attempt — the one that cost us the most time — was to build every non-English document as a machine translation of a canonical US English template, with a post-processing pipeline to swap jurisdiction-specific references.

This was a legal error, and it took a lawyer flagging it for us to understand how bad it was.

A US NDA has clauses that are load-bearing under US law and meaningless under German law. The DTSA 2016 immunity notice, for example, is required for a US employer to access federal trade-secret exemplary damages. Translate that notice into German and drop it into a Vertraulichkeitsvereinbarung, and you've produced a document that: (a) is less enforceable than a native German NDA would be, because it's missing the statutory framing a German court expects, and (b) contains a clause that, from the perspective of a German reviewer, is nonsensical. A worst-of-both-worlds outcome.

We scrapped the pipeline. Every jurisdiction-specific document is now drafted from native precedent, reviewed by counsel in that jurisdiction, and only then surfaced in the library.

Mistake 3: A Single Monolithic sitemap.xml

Our first sitemap was 11,000 URLs in one file. Google Search Console's crawling behaviour with sitemaps that large is, charitably, erratic. Indexation was partial and unpredictable. Country-specific templates weren't getting surfaced.

We moved to a sitemap index with one sub-sitemap per jurisdiction: sitemap-us.xml, sitemap-uk.xml, sitemap-ca.xml, and so on through the 21 country codes. The root is a sitemap-index.xml pointing to all of them. This is a well-known pattern for large multi-regional sites, and indexing latency dropped noticeably once we moved to it.

The Architecture That Works

The structure we settled on is jurisdiction-first, with language as a qualifier only when we have a genuine native-language variant:

URL Pattern

/<country-slug>/<category>/<subcategory>/<document-slug>

Examples:

/usa/business/contracts/non-disclosure-agreement-france — US-jurisdiction cross-border NDA, in English
/uk/business/contracts/mutual-confidentiality-agreement-uk — UK-jurisdiction mutual NDA, in English
/canada/business/contracts/mutual-nda-canada — Canadian-jurisdiction mutual NDA, in English
/australia/business/contracts/mutual-non-disclosure-agreement-australia — Australian-jurisdiction mutual NDA, in English

When the jurisdiction's primary language is non-English, we add a language prefix for the native-language version:

/espana/business/contracts/due-diligence-nda-spain — Spanish-jurisdiction NDA, English surface
/es/espana/business/contracts/acuerdo-confidencialidad-due-diligence-espana — same Spanish-jurisdiction NDA, Spanish native
/pt/brasil/business/contracts/acordo-confidencialidade-comercial — Brazilian-jurisdiction NDA, Portuguese native

The country slug is always in the URL before the category. Language, when applicable, is a prefix — not a replacement. This lets us route /espana visitors to the English-surface page by default, and /es/espana/ to users who want the native Spanish version, without treating the two as the same document for SEO purposes.

hreflang — Used Narrowly

We use hreflang only between genuine translations of the same document. The English-surface and Spanish-surface versions of the Spain NDA template are translations of each other — same jurisdictional framing, same statute, same substantive clauses, different language. For these, we include hreflang="en" ↔ hreflang="es" reciprocal tags.

We do not use hreflang between /usa/business/contracts/... and /espana/business/contracts/.... These are different documents, not translations. Presenting them to Google as hreflang pairs would be misleading — and worse, would risk Google serving the Spanish-jurisdiction page to a US searcher who explicitly wants US-jurisdiction content.

This is the single most frequently misunderstood detail in legal-content SEO. hreflang is for language alternates of the same content, not for conceptually equivalent content across legal systems.

Schema.org LegalDocument Markup

Every template page ships with schema.org/LegalDocument structured data. The critical properties, and why they matter:

jurisdiction — an explicit jurisdictional statement (e.g., "United Kingdom", "Spain", "São Paulo, Brazil"). This is what tells a jurisdictionally-aware search engine that this page is specifically for that legal system.
inLanguage — BCP-47 language code. "en-GB" vs "en-US" matters for English jurisdictions. "pt-BR" vs "pt-PT" matters for Portuguese.
legislationType, legislationJurisdiction — for templates that reference specific legislation (e.g., the Brazilian LGPD NDA template declares legislationJurisdiction: "Brazil" and the relevant statute).
isPartOf — pointing to the jurisdiction hub page (e.g., /usa) to establish the content hierarchy.

Structured data is not cosmetic. Legal vertical search (including tools like Google's legal-specific scholar features and some of the emerging AI-native legal-search products) relies on it.

Sub-Sitemaps Per Jurisdiction

As mentioned in Mistake 3, the sitemap architecture is:

/sitemap-index.xml
  ├── /sitemaps/sitemap-static.xml
  ├── /sitemaps/sitemap-us.xml
  ├── /sitemaps/sitemap-uk.xml
  ├── /sitemaps/sitemap-ca.xml
  ├── /sitemaps/sitemap-au.xml
  ├── ... (21 country sub-sitemaps total)

Each sub-sitemap is scoped to its jurisdiction. Search Console crawling is far more predictable at this granularity, and when a new country goes live, we add a single sub-sitemap entry rather than mutating a monolith.

Content Modeling: The Two Relationships

Once we had the URL and SEO layer sorted, the remaining modeling problem was internal: how do we represent the relationships between documents in our own data layer?

There are exactly two relationships that matter:

Language-variant — document A and document B are translations of the same jurisdictional instrument. Example: the Spanish NDA in English and the Spanish NDA in Spanish. These share jurisdiction, share clauses, differ only in language. We mark them with a shared canonical_group_id and a language field per row.
Conceptual-equivalent — document A (US NDA) and document B (Spanish NDA) serve similar business purposes across different legal systems. These are not translations; they are peers. We mark them with a concept_id that indexes into a taxonomy of legal concepts, not a translation map.

These two relationships look the same to a casual observer and are radically different from a legal and SEO standpoint. Keeping them separate in the data layer means:

We can render a language switcher (within a jurisdiction) only when a true language variant exists.
We can render a "same concept in another country" picker (cross-jurisdiction) with a clear "Note: different legal system" label rather than pretending it's a translation.
We can emit hreflang only for relationship 1, not relationship 2 — preserving SEO integrity.

What the SEO Payoff Looks Like

The practical result of getting this architecture right, as opposed to the language-first mistakes we made initially, is that we now rank for long-tail, native-language, country-specific legal queries in markets where US-centric competitors effectively don't compete.

A Spanish solo founder searching acuerdo confidencialidad due diligence España lands on a Spanish-language, Spain-jurisdiction document drafted against Ley 1/2019. A Brazilian startup searching acordo de confidencialidade comercial lands on a Portuguese-language, Brazil-jurisdiction document drafted against Lei 9.279/1996 and the LGPD. These searches, measured in aggregate, are the bulk of the organic traffic that a multilingual legal library is built to capture.

English-language searches from the US, UK, Canada, and Australia are more competitive — plenty of US-centric platforms contest that space. But those platforms do not exist, in any meaningful sense, in the Spanish, Portuguese, French, German, Italian, or Dutch long tail. The moat is the multilingual native-jurisdictional coverage, not the English-language core.

Takeaways for Anyone Building Something Similar

If you are building any kind of jurisdictionally-sensitive content library — legal templates, tax forms, medical documentation, financial compliance materials — here's the short version:

Jurisdiction is not language. Do not conflate them at the URL level, at the content level, or at the SEO level. An NDA for Germany is not "the German version of an American NDA."
hreflang is for translations. Use it only when two pages are the same document in different languages. Do not use it for cross-jurisdiction equivalents. Abusing hreflang semantics is worse than not using it.
Machine translation is a starting point, not a finished product. For regulated domains (legal, medical, financial), a translation pipeline without native-jurisdiction expert review will produce documents that are technically grammatical and legally wrong.
Use structured data aggressively. schema.org/LegalDocument (or its analogue in your vertical) gives search engines and AI-powered search tools the signals they need to route users to jurisdiction-appropriate content.
Sub-sitemaps per jurisdiction scale better. Monolithic sitemaps become unreliable past a few thousand URLs. Sub-sitemaps are a well-supported standard; use them.
Model the two relationships separately. "Same document, different language" and "same concept, different legal system" are different relationships. Give them different fields in your schema. Render them differently in your UI. Signal them differently in SEO.

The underlying lesson, if there is one: multilingual ≠ multi-jurisdictional. For most consumer content, multilingual SEO is the harder problem. For regulated-vertical content, multi-jurisdictional modeling is the harder problem, and it subsumes multilingual as a special case.

This post was written by the engineering team at Forms Legal. We maintain 11,000+ free legal document templates across 21 jurisdictions in English, Spanish, Portuguese, French, German, Italian, and Dutch. Our sitemap index is publicly browsable at /sitemap-index.xml for anyone wanting to see the architecture described above in production.

DEV Community