DEV Community: overthelex

340 Million Records and 64 Tools: The Complete Data Map of LEX AI

overthelex — Fri, 03 Jul 2026 21:52:42 +0000

The LEX AI platform is built on a simple idea: lawyers shouldn't waste time manually searching across dozens of websites. Instead — one question in chat, and the AI finds the right data from every available source.

Today in production we serve 340+ million records from 30+ sources, unified through 64 MCP tools (Model Context Protocol). This article is the complete overview: what we have, where it comes from, and how it works.

The Big Picture

| Category | Records | Tools |
|———-|———|——-|
| EDRSR (court decisions) | ~208M | 6 |
| Court system | 30.5M+ | 7 |
| OpenReyestr + NAIS | 41.8M | 24 |
| Sanctions & anti-corruption | 1.7M | 4 |
| ARMA + Due Diligence | 2M+ | 5 |
| Intellectual property | 295K | 3 |
| Public finance | 1M+ | 4 |
| Verkhovna Rada | 85K | 4 |
| Legislation | 318K | 3 |
| Attorneys & judges | 73K+ | 3 |
| Total | ~340M+ | 64 |

1. EDRSR — The Heart of the Platform (208M Records)

The Unified State Register of Court Decisions is the largest data source on the platform. Two datasets:

edrsr_documents — 93M metadata records (court, judge, date, category, parties)
edrsr_fulltext — 115M full decision texts (~1 TB)

What You Can Do

\"Find Supreme Court decisions on moral damages compensation for 2024-2025" \\

The AI selects one of 6 tools:

Semantic search means you describe a situation in your own words, and the system finds decisions with similar circumstances — even when not a single keyword matches.

2. Court System (30.5M+ Records)

Beyond the decisions themselves, the platform holds data on the entire judicial process:

Procedural Tools

A separate group of tools assists with procedural work:

calculate_procedural_deadlines\ — calculate appeal deadlines by procedural code (CPC, CC, CAS, CrPC)
search_procedural_norms\ — find relevant articles of procedural codes
build_procedural_checklist\ — generate a checklist for a specific case stage

\"What is the deadline for appealing a commercial court decision?" → Article 256 CC: 20 days from the date of the full text \\

3. OpenReyestr + NAIS (41.8M Records)

11 state registries from data.gov.ua plus EDR data — the most comprehensive database for due diligence:

| Registry | Records |
|———-|———|
| Enforcement proceedings (ASVP) | 29M |
| Debtors registry | 10.4M |
| Individual entrepreneurs (FOP) | 6.9M |
| Company founders | 3M |
| Authorized signatories | 2.8M |
| Legal entities | 2M |
| Notarial special forms | 1.8M |
| Streets (address registry) | 1.5M |
| Administrative-territorial units | 924K |
| Tax debt | 861K |
| Social contribution (SSC) debt | 669K |
| VAT payers | 264K |
| Simplified taxation | 153K |
| Bankruptcy | 36K |
| Notaries | 5.8K |
| Arbitration managers | 3.4K |
| Forensic examination methods | 1.5K |

24 OpenReyestr tools cover: company search, beneficial owners, debtors, enforcement proceedings, bankruptcy, notaries, experts, VAT, SSC, and address data.

Example: Due Diligence in 30 Seconds

\"Check counterparty by EDRPOU 12345678" \\

The AI automatically checks:

Registration in EDR (legal entity / individual entrepreneur)
Enforcement proceedings (ASVP)
Debtors registry
Bankruptcy
Sanctions lists
Court decisions (EDRSR)
Tax debt

The result is a structured report from all sources in a single window.

4. Sanctions & Anti-Corruption (1.7M Records)

\"Is Ivanov Petro Serhiyovych on any sanctions lists?" → Search across 1.25M records: NSDC, OFAC, EU, UN, UK, and 340+ other programs → Fuzzy matching by name, TIN, passport, EDRPOU \\

5. Intellectual Property (295K Records)

| Source | Records |
|——–|———|
| Patents (Ukrpatent) | 118K |
| Trademarks | 176K |
| Shareholders (NSSMC) | 1.3K |

Search by name, owner, NICE class (for trademarks) or IPC (for patents), application number.

\"Find trademarks containing 'Legal' in class 42" → 3 results: LEX AI (certificate No. 345678), LegalTech Pro... \\

6. Public Finance (1M+ Records)

| Source | Records |
|——–|———|
| Prozorro tenders | 1M |
| Spending.gov.ua contracts | 2.8K |
| SSSU financial data | 8.4K |
| Inspection plans | 32K |

7. Verkhovna Rada (85K Records)

4 tools for monitoring parliamentary activity:

| Data | Records |
|——|———|
| Bills | 14.8K |
| Votes | 21.9K |
| Deputies | 463 |
| Deputies' assistants | 4.4K |
| Full legislative texts | 44K |

\"Which deputies voted for bill 1234?" → Full list broken down by faction \\

8. Legislation (318K Records)

| Source | Records |
|——–|———|
| EDRNPA (cards) | 141K |
| EDRNPA (texts) | 141K |
| Law sections (chunks) | 25K |
| Articles (structured) | 12K |

3 tools for working with legislation:

search_legislation\ — semantic search across legislative texts
get_legislation_article\ — specific article ("Art. 625 CC")
get_legislation_history\ — amendment and revision history

The system understands aliases: "Constitution", "CC" (Civil Code), "CrPC" (Criminal Procedure Code), "CommC" (Commercial Code), etc.

9. Analytical Tools

Beyond search, the platform includes tools for legal analysis:

Architecture: How It Works

\Lawyer → Chat → AI Model → Intent Classifier ↓ Tool Selection (1-5 out of 64) ↓ PostgreSQL / Qdrant / Redis ↓ Structured Response \\

Each tool is an MCP tool (Model Context Protocol). The AI model autonomously selects which tools to call based on the query context.

Three transports:

MCP stdio — for Claude Desktop
HTTP API — for web applications
SSE — for streaming results

What's Next

Coming up:

Completing UIPV import — trademarks (46% loaded), utility models (162K), industrial designs (48K)
DRRP (real estate registry) — agreement with NAIS
DRORM (movable property encumbrances) — agreement with NAIS
SLC (State Land Cadastre) — agreement with the State Geocadastre
Spending.gov.ua — acts, supplementary agreements, penalties (API ready)
Bulk download RTF — full texts of EDRSR decisions

Summary

LEX AI is more than search. It's a single access point to all of Ukraine's open legal data:

340M+ records from 30+ sources
64 MCP tools for search, analysis, and verification
Semantic search — describe the situation, find the decisions
Due diligence — counterparty check in 30 seconds
Procedural calculators — deadlines, checklists, norms

All of this is live right now at legal.org.ua.

Originally published on legal.org.ua.

Distributed Monolith: When Microservices Are Just a Monolith with Network Latency

overthelex — Fri, 03 Jul 2026 21:47:30 +0000

You split your code into services. You have separate containers. You even have a gateway. So why does deploying one service still break the other?

What is a distributed monolith

A distributed monolith is an architecture that looks like microservices but behaves like a monolith. Services are separated at the code level but remain coupled at the infrastructure, data, or deployment level.

Classic symptoms:

Shared database – different services read/write to the same PostgreSQL instance
Shared library without versioning – a change in a common package breaks everyone simultaneously
One docker-compose – all services are deployed together, even if only one changed
Synchronous HTTP calls – service A cannot function if service B is unresponsive
Shared cache – one Redis for everyone, LRU eviction from one service kills another's cache

Sound familiar? That's our architecture. And we believe that right now – it's the right choice.

When a distributed monolith is the right choice

Here's an unpopular opinion: a distributed monolith isn't always a problem. At a certain scale, it's the optimal architecture.

Benefits we get

1. Operational simplicity – One docker compose up brings everything up.

2. Development speed – A shared package means DRY.

3. Transactional integrity – One PostgreSQL = the ability to JOIN across schemas.

4. Debuggability – One docker compose logs shows the entire request flow.

5. Cost – One server instead of three.

The formula: when a distributed monolith is enough

Team < 5 developers, load < 1000 RPS, deploys < 5/day, one server handles it, no requirements for independent scaling.

Step-by-step evolution plan

Phase 1: Hardening (effort: low, impact: 80%)

Split Redis into separate instances per service
Version the shared package with semver
Add circuit breaker in RemoteServiceClient

Phase 2: Infrastructure independence

Separate PostgreSQL instances
Split docker-compose per service
API contracts between services

Phase 3: True microservices (team > 5)

Service discovery instead of env vars
Message queue for async operations
Independent CI/CD pipelines

Conclusion

A distributed monolith is not a diagnosis. It's a stage in architectural evolution. 80% of microservice benefits can be achieved with 20% of the effort – by splitting Redis, adding a circuit breaker, and versioning your shared package.

Originally published on legal.org.ua.

Opus + RAG vs Fine-tuned LLM + RAG: Two Approaches to Legal AI — LEX vs Harvey

overthelex — Fri, 03 Jul 2026 21:42:18 +0000

Harvey spent $100M+ and trained a custom model on the entire US case law corpus. We connected Claude Opus to 100M+ court decisions from EDRSR via RAG. Both work. But these are fundamentally different engineering and business decisions.

When an ordinary AI startup from Ukraine applies to Google for Startups Cloud Program and receives a five-figure dollar grant — that's not luck. It's validation of the approach. Google saw the same thing we see: 100M+ court decisions, an open data corpus unmatched in scale anywhere in Europe, and a team that has already built a production RAG system on top of it. Google Cloud resources — TPU pods, compute credits, engineering support — are not charity. It's an investment in Ukraine's jurisdiction becoming the first proving ground for open-weight legal AI based on DeepSeek v3, trained on real data from a real legal system. Harvey spent $100M on a partnership with OpenAI for US case law. We're doing the same for Ukraine — with a grant from Google, an open model, and a corpus assembled from public registries.

Context: Why This Comparison Matters

Harvey AI is the most prominent legal AI company in the world. $5B+ valuation, 42% of the US top-100 law firms as clients, a partnership with OpenAI at the level of custom model training. Their approach is the industry benchmark.

LEX AI is a Ukrainian legal AI platform built on a fundamentally different architecture: a foundation model (Claude Opus) + RAG over the complete corpus of the Unified State Register of Court Decisions (EDRSR) — 100+ million documents.

Both systems solve the same problem: help a lawyer find relevant case law, analyze it, and apply it. But their architectural approaches are diametrically opposed.

Harvey's Approach: Fine-tuned LLM + RAG

Architecture

Harvey built a three-tier system:

1. Foundation Layer — GPT-4/GPT-5 as the base model, deployed on Azure

2. Domain Fine-tuning Layer — pre-training and post-training on 10 billion tokens of legal data:

The complete US case law corpus (starting with Delaware, then expanding nationwide)
Legal reasoning patterns
Specialized terminology and citation formats

3. Client Customization Layer — adaptation for specific firms:

Firm document templates
Style guides
Internal precedents

Search System

Separately from the model, Harvey built a custom retrieval system:

Voyage AI embeddings (voyage-law-2-harvey) — trained on 20B+ tokens of case law
Custom legal embeddings achieved 25% reduction in irrelevant results compared to generic embeddings
Hybrid search (vector + keyword)
Legal-specific preprocessing and postprocessing
Integration with LexisNexis for Shepardization (checking whether a precedent is still good law)

Results

97% — the rate at which lawyers in blind testing chose the fine-tuned model's response over GPT-4
0.2% hallucination rate (vs. 17-33% for generic models)
Every sentence backed by a citation to an actual case
Multi-model orchestration: different models for drafting, research, and jurisdiction-specific queries

Cost of This Approach

$100M+ in investment (Series C from Sequoia, Google Ventures, et al.)
Partnership with OpenAI at the level of custom model training
Team of 200+ engineers
Months of training and verification per iteration
Lock-in to a single jurisdiction (US case law) with enormous effort required to scale

LEX's Approach: Opus + RAG

Architecture

Our approach is fundamentally different — we don't train the model, we build infrastructure around it:

1. Foundation Model — Claude Opus (as-is, no fine-tuning)

1M context window
Strongest reasoning among publicly available models
Native understanding of Ukrainian language

2. RAG over the complete EDRSR corpus:

100+ million court decisions
Full-text search (PostgreSQL GIN indexes with 'simple' language for Cyrillic)
Semantic search (Qdrant + OpenAI embeddings)
Semantic Sectionizer — splits documents into logical sections (articles, parts, clauses)

3. MCP (Model Context Protocol) — structured interface between model and data:

QueryPlanner classifies intent and selects search strategy
DocumentService retrieves and caches documents
LegislationService handles legislation (understands "Article 124 of the Constitution")
EdsrFtsService — full-text search across the entire EDRSR

Search System

Lawyer's query
    │
    ▼
QueryPlanner (intent classification)
    │
    ├── Semantic Search (Qdrant)
    │   └── embeddings: text-embedding-ada-002
    │
    ├── Full-text Search (PostgreSQL)
    │   └── GIN indexes, 'simple' language config
    │
    └── Legislation Lookup (RADA API)
        └── intelligent sectioning
    │
    ▼
Context Assembly (relevant chunks)
    │
    ▼
Claude Opus (reasoning + generation)
    │
    ▼
Response with source citations

Results

Full coverage of Ukrainian jurisdiction (100M+ decisions — the entire EDRSR)
Citations with references to specific cases
Understanding of martial law context, mobilization, new legislation
Real-time corpus updates (new decisions enter the system automatically)
Legislation, registries, and parliamentary data in a single interface

Cost of This Approach

Team: 1 developer + Claude Code (735 commits in 25 days)
Zero model training costs
API costs: pay-per-use (Opus + embeddings)
Infrastructure: 1 prod server, Docker Compose, PostgreSQL + Qdrant
Time to production: weeks, not months

Comparison: What Actually Differs

1. Where Legal Knowledge Lives

A fine-tuned model "knows" jurisprudence at an intuitive level. It has seen millions of cases during training and developed patterns of legal reasoning. When a lawyer asks about piercing the corporate veil, the model doesn't just search — it "remembers" the key precedents.

Opus + RAG "knows" jurisprudence through context. The model receives relevant case fragments via RAG and applies its generic reasoning to analyze them. Opus doesn't "remember" case law — but it can read and analyze it better than any specialized model of smaller scale.

2. Hallucinations and Reliability

Harvey achieved a 0.2% hallucination rate through:

Fine-tuning on real cases (the model has "seen" them)
Post-processing with citation verification
Shepardization via LexisNexis

LEX minimizes hallucinations through:

Grounding — the model responds only based on provided context
Explicit instructions — the system prompt requires source citations
Verification — QueryPlanner checks that real documents were found
Constitutional constraints — the model is explicitly instructed not to draw conclusions beyond the provided data

3. Updatability

This is the biggest advantage of the RAG approach.

A fine-tuned model is a snapshot of the corpus at the time of training. A new Supreme Court decision handed down yesterday doesn't exist for the model until the next fine-tuning cycle (weeks to months).

A RAG system updates in real time. A decision entered into EDRSR this morning is available for search by tonight. For a jurisdiction under martial law, where new legislation appears every week, this is critical.

4. Scaling to New Jurisdictions

Harvey scales with difficulty: each new jurisdiction means a new cycle of data collection, training, and verification. US case law ≠ EU case law ≠ Ukrainian judicial practice. Reasoning patterns differ. Legal terminology differs. The hierarchy of sources differs.

RAG scales easily: connect a new document corpus, configure embeddings, update the search pipeline. We've already connected:

EDRSR (100M+ decisions)
Legislation via RADA API
OpenReyestr (business entity registry)
Parliamentary data (deputies, bills, votes)

5. Reasoning Customization

Fine-tuning lets you embed legal reasoning into the model:

The model "understands" legal argumentation
It can independently build chains of precedents
Less dependent on search quality

Prompt engineering + RAG lets you control reasoning:

Transparent logic (you can read the prompt)
Easy to change strategy (update the prompt, not retrain the model)
Constitutional constraints via RLHF principles in the prompt

Why We Chose RAG Over Fine-tuning

1. Economic Reality

Fine-tuning a legal model is a $10M+ project even for a minimum viable product. Harvey raised $100M+ and has a team of 200+ people. For the Ukrainian market, where the entire legal tech TAM is a fraction of what a single Am Law 100 firm earns, such investment makes no economic sense.

The RAG approach let us ship to production with a one-person team and a budget for API calls.

2. Iteration Speed

Fine-tuning cycle: collect data → clean → train → evaluate → deploy. Weeks to months.

RAG cycle: update the prompt → deploy. Minutes.

When the Grand Chamber of the Supreme Court adopts a new legal position that changes interpretation across an entire field — a RAG system adapts in hours, not months.

3. Foundation Model Quality

In 2023, when Harvey started fine-tuning, GPT-4 was the best model available, and its reasoning on legal tasks was "good but not sufficient." Fine-tuning made sense.

In 2026, Claude Opus has a 1M context window and reasoning that surpasses specialized models. The gap between "generic Opus + the right context" and "fine-tuned GPT + retrieval" has narrowed significantly. Foundation models have caught up with fine-tuned specialized models on reasoning quality — and continue improving with every release.

4. Ukrainian Jurisdiction

Ukrainian law is not common law. There is no stare decisis (binding precedent). Case law is advisory in nature. This means:

Precise precedent citation is less critical than in US law
Knowing current legislation + Supreme Court legal positions matters more
The corpus changes constantly (martial law, new statutes every week)
RAG with real-time updates is a perfect fit for this context

5. Transparency and Control

A fine-tuned model is a black box. You don't know why it generated a particular response. Which weights fired? Which cases did it "recall"?

RAG is transparent. You can see:

Which documents were found (search results)
What entered the context (retrieved chunks)
What the model received as input (prompt)
How it arrived at the answer (reasoning in output)

For a legal system where every response can affect a person's fate, transparency is not a nice-to-have — it's a requirement.

Where Fine-tuning Still Wins

Honesty demands acknowledgment: there are tasks where Harvey's fine-tuned model is objectively better:

1. Legal reasoning without context — when a lawyer asks a general legal question without a specific case, a fine-tuned model gives a better answer because it "knows" jurisprudence. RAG depends on search quality.

2. Chains of precedent — a fine-tuned model can independently build an argument through a series of related precedents because it "saw" those connections during training. RAG may miss a precedent if the search didn't find it.

3. Legal document stylistics — a model trained on millions of legal texts better mimics the style of legal writing. A generic model requires more prompt engineering.

4. Scale — when processing hundreds of contracts at once (due diligence), a fine-tuned model is more efficient because it doesn't need retrieval at every step.

The Future: Convergence of Approaches

The boundary between RAG and fine-tuning is blurring:

Harvey is building RAG on top of its fine-tuned model (their case law search is RAG)
We are exploring domain-specific embeddings (an analogue of voyage-law, but for Ukrainian jurisprudence)
Both are moving toward agentic workflows — multi-step systems where the model decides what to search for

The truth is that "fine-tuning vs RAG" is a false dichotomy. Harvey uses both fine-tuning and RAG. We use RAG and will be adding elements of domain adaptation (custom embeddings, constitutional RLHF).

The ultimate architecture for legal AI is a spectrum:

Pure RAG ←──────────────────────────────────→ Pure Fine-tuning
  │                                                    │
  LEX (Opus + EDRSR)            Harvey (custom GPT + RAG)
  │                                                    │
  Cheap, fast,                          Expensive, slow,
  transparent, updatable                deep, precise

The optimum for each jurisdiction, team, and budget lies somewhere between these poles.

LEX + Google + DeepSeek v3: Fine-tuning for Ukrainian Jurisdiction

We're not just comparing approaches — we're moving toward fine-tuning ourselves. LEX AI is working with Google on a task analogous to Harvey + OpenAI, but for Ukrainian law.

Why DeepSeek v3

DeepSeek v3 is an open-weight model with a Mixture-of-Experts architecture (671B parameters, 37B active per query). For fine-tuning on Ukrainian jurisdiction, it's the ideal foundation:

Open weights — full control over training, no API provider lock-in
MoE efficiency — inference cost is several times lower than dense models of comparable scale
Strong multilingual capabilities — quality Cyrillic and Ukrainian language support out of the box
Legal reasoning — baseline reasoning on par with GPT-4o, providing a high starting point for domain adaptation

What We're Training

The fine-tuning corpus: 100M+ court decisions from EDRSR, Ukrainian legislation, Supreme Court legal positions. This is the same dataset that currently lives in our RAG system — but instead of feeding it into context every time, we're embedding legal knowledge directly into the model weights.

Key directions:

Pre-training on the full EDRSR corpus — the model will "see" all of Ukraine's case law
Post-training on "lawyer query → quality response" pairs with legal annotators
Constitutional RLHF — reward signal based on the Constitution of Ukraine (described in our previous article)
Custom embeddings for Ukrainian legal text (analogous to Harvey's voyage-law-2-harvey)

Google's Role

Google Cloud provides training infrastructure: TPU pods for pre-training on hundreds of millions of documents, distributed training tools, and expertise in optimizing MoE models. The partnership enables us to do work that previously required a team of 200+ engineers.

How This Changes LEX

The final LEX architecture will be hybrid:

Lawyer's query
    │
    ▼
Fine-tuned DeepSeek v3 (legal reasoning in weights)
    +
RAG (current decisions, new legislation)
    +
Constitutional RLHF (ethical constraints)
    │
    ▼
Response with deep legal reasoning
+ current sources
+ constitutional guarantees

This is what Harvey built for US common law at $100M+ with OpenAI. We're building the same for Ukrainian jurisdiction with Google and DeepSeek — on open data, with an open model, for a market where access to justice is not a business metric but a matter of survival.

Conclusions

For Ukrainian legal tech in 2026, RAG + Opus is the right choice. Not because fine-tuning is bad. But because:

Foundation models have become smart enough for RAG to perform on par with fine-tuned specialized models
Ukrainian jurisdiction demands real-time updates that fine-tuning cannot provide
The economics of the Ukrainian market don't allow spending $100M on model training
RAG transparency is critical for a legal system where an error is not a bug but a human rights violation

Harvey took the right path for their context: US common law, 500B market, 100M in investment. We're taking the right path for ours: Ukrainian law, martial law, a team of one person and an AI partner.

Different realities — different architectures. But the goal is one: to make justice more accessible.

Sources:

Registration: legal.org.ua

Originally published on legal.org.ua.

How We Vectorize 33.7M Ukrainian Court Decisions via Voyage AI

overthelex — Fri, 03 Jul 2026 21:35:48 +0000

EDRSR — the Unified State Register of Court Decisions — is effectively all of Ukraine's judicial practice in open access. Today Qdrant holds **44M+ vectors: criminal (19M), civil (14.3M), commercial (5.1M), misdemeanors (5.6M). Vectorization of civil cases (CPC, justice_kind=1) — the largest cohort at 33.7M documents — runs on a dedicated EC2 instance (r6a.xlarge, 32 GB RAM, 2 TB gp3). Here's what's under the hood: models, pipeline, cost, rakes, and current status.

Why Vectorize Courts

When a lawyer searches "is there case law on recovering bank prepayment fees" — they don't want to open 40 decisions and read them through. They want the system to surface the top 5 most relevant ones, pull out key paragraphs, and show how courts reasoned. Full-text search (FTS) over keywords doesn't give that — it returns every document containing the word "fee", and there are thousands.

For this semantic task you need vector representations of text. The model turns a paragraph from a decision into a point in a 1024-dimensional space; semantically similar paragraphs sit near each other. A kNN search in Qdrant returns the top K nearest, and an LLM composes the answer from exactly those relevant fragments.

The only problem: the register is big. Very big.

Scale

Our prod database holds full texts of decisions starting from 2006. Breakdown by procedural type:

Civil (CPC) — 33.7M documents. The largest category. Consumer, housing, labor, family.
Criminal (CrPC) — 12M+
Administrative (CAS) — 14M+
Commercial (CC) — 6M+
Misdemeanors (CUaP) — 6M+

The Qdrant collection edrsr_decisions on a dedicated EC2 currently holds 44M+ vectors (122 segments, on_disk=true):

| Proceeding type | justice_kind | Vectors |
|—|—|—|
| Criminal (CrPC) | 2 | 19,036,347 |
| Civil (CPC) | 1 | 14,328,427 |
| Misdemeanors (CUaP) | 5 | 5,579,432 |
| Commercial (CC) | 3 | 5,098,662 |
| Total | | 44,042,868 |

Civil cases processed: 14.3M out of 33.7M — that's 42%. After CPC completes there will be roughly 63M+ vectors in a single collection.

For scale: a typical RAG project holds 100K — 1M vectors. Ours is two orders of magnitude bigger.

Stack

Embedding model. voyage-3.5 from Voyage AI. 1024-dimensional output, 6 cents per million tokens. We tested Voyage 3 Large and OpenAI text-embedding-3-large, but the quality gain on legal text didn't justify the cost difference (Voyage 3 Large is 3x more expensive). We already had an index on 3.5 for prior jurisdictions, so we stay on it for compatibility.

Vector DB. Qdrant v1.17, self-hosted in Docker on a dedicated EC2 (r6a.xlarge — 4 CPU, 32 GB RAM, 2 TB gp3). Collection edrsr_decisions with HNSW index, on_disk=true for both vectors and payload. Payload carries doc_id, court_code, judge, justice_kind, adjudication_date, plus chunk_index/total_chunks and chunk text. Dedicated instance because 44M+ points with HNSW were killing RAM on prod and blocking the chat service (OOM kills during segment optimization).

Source-of-truth. PostgreSQL 15, partitioned tables: RANGE by adjudication_date, LIST by adj_year. Full texts live in edrsr_fulltext, metadata in edrsr_documents. A JOIN across all partitions is 30M+ rows, so the pipeline walks year by year.

Runtime. Python 3.11, asyncio, aiohttp. No frameworks — direct HTTP to Voyage and Qdrant. 440 lines of code, one file.

Chunking

Court decisions are long. Average CPC ruling is 8–12K characters, longest reach 200K. Voyage accepts up to 32K tokens per input, but quality falls off on long contexts, and one long vector is poor for retrieval — the LLM can't tell which paragraph is relevant.

So we chunk: up to 2048 characters per chunk, 50-word overlap between neighbors. We split on paragraph boundaries to keep semantic coherence. On average one decision yields 2.7 chunks.

Each chunk in Qdrant gets a composite ID (doc_id × 1000 + chunk_index) — no collisions, and a single payload filter query pulls all chunks of a specific decision.

Concurrency and Throttling

Voyage has a rate limit — 2000 RPM per key for voyage-3.5. We have two keys and round-robin between them, giving a theoretical 4000 RPM ceiling. In practice we hold concurrency 50 and get a steady 63 documents per second. That's ~170 requests per minute per key — comfortably under the rate limit.

We tried concurrency 70 — first two million were fine, then the process stalled on the GIL (13% CPU, no progress, no errors — just stuck on a thread lock). Dropped to 50 — ran smooth, no deadlocks, no 429s.

Every 100 documents triggers a batch to Voyage (batch_size=500 chunks/request), gets embeddings, composes Qdrant points, and does one upsert. On Voyage error (429, network) — exponential backoff with jitter, max 5 retries. On Qdrant error — retry the same batch.

Checkpoint and Resume

At 33.7M documents any failure — network, OOM, container crash — means hours of lost work. So:

Every 1000 processed documents the pipeline writes a checkpoint JSON: {last_doc_id, processed_docs, total_chunks, total_tokens, timestamp}
On startup — reads checkpoint, resumes with WHERE doc_id > last_doc_id
All metrics (docs, chunks, tokens, cost) accumulate across checkpoints

This has saved us twice. First time — when postgres-prod ran out of memory (more on that below). Second time — when Qdrant restarted and lost its API key from env. Both times we just restarted from the same checkpoint with no duplicated work.

Prod Incident: Postgres OOM

At 2.86M documents postgres-prod fell into recovery mode. Root cause: config mismatch — shared_buffers=16GB, container memory limit 12G. PG tried to allocate more than it had; OOM killer killed the process.

Fix in PR #1453: mem_limit: 24G, shm_size: 16g. After restarting the container with the new limits PG came up in 4 seconds and stopped falling over. The episode highlighted an infra pattern: postgresql.conf parameters (shared_buffers, work_mem, maintenance_work_mem) must align with container limits. Otherwise the system runs fine until the first load spike, then falls into recovery.

We also bumped swap on the local dev machine from 8GB to 24GB — heavy Voyage API traffic generates a lot of temporary objects in the Python process memory, especially while Qdrant is rebuilding its index in the background.

Cost

One civil document averages 2.7 chunks × 850 tokens = 2300 tokens. At voyage-3.5 pricing of 6 cents per million tokens, one document costs 0.014 cents — roughly 138 microdollars.

As of today, 14.3M documents out of 33.7M are processed — that's 42% of the cohort. We've spent approximately 1,980 dollars on the Voyage API and about 63 hours of pipeline runtime. Remaining 19.4M documents cost roughly 2,680 dollars and 85 hours (3.5 days of continuous processing). Total cost of the full CPC cohort vectorization — around 4,660 dollars.

Plus the EC2 r6a.xlarge for Qdrant — ~\0.20/hr (on-demand), roughly \145/month. Cheaper than OOM incidents on prod.

For scale: the same budget on OpenAI text-embedding-3-large would get us only a quarter of the volume. Voyage wins specifically at this scale.

What It Gives Users

Semantic search already works across 44M+ vectors today. Once the civil cohort is fully indexed, the collection will hold 63M+ chunks. A lawyer types a natural-language query — "case law on voiding a sale contract due to seller incapacity" — and the system returns the most relevant decisions from the right jurisdiction, with key paragraph extracts and EDRSR links.

That's a different class of product compared to FTS. FTS finds documents where a phrase appears. Semantic search finds documents where your situation is being discussed — even when the court used entirely different words.

TL;DR

33.7M civil EDRSR cases → Voyage voyage-3.5 → Qdrant (14.3M / 33.7M = 42% done)
44M+ vectors in Qdrant on a dedicated EC2 (r6a.xlarge, 32 GB RAM)
63 docs/sec, concurrency 50, two API keys round-robin
~4,660 dollars total cost for full CPC vectorization + ~\$145/mo EC2
Checkpoint/resume JSON, survived two incidents already
After completion — 63M+ vectors in one collection, unified semantic search over all Ukrainian judicial practice

Runs in tmux on a dedicated EC2, checkpoint fires every 1000 docs. Snapshot sync to prod Qdrant every 6 hours via cron. Boring reliable engineering, not heroics.

Originally published on legal.org.ua.

2 TB of Ukrainian Law + DeepSeek V3 860B on GCP: What We'd Get

overthelex — Fri, 03 Jul 2026 21:35:47 +0000

In production we have ~1.5 TB of full-text court decisions and their vector embeddings, plus another ~550 GB of other legal data: registries, legislation, business entities, a Spanish case law corpus, EU-Lex. If we take this corpus and train an MoE model the size of DeepSeek V3, scaled to 860B parameters, on GCP — what comes out? We break down the dataset, architecture, compute cost, and the properties such a model would have on Ukrainian law.

What's in the Dataset

The entire corpus is what's already running in SecondLayer's production. No extra scrapes, no Common Crawl, no noise.

EDRSR — the dataset core, ~1.5 TB. The Unified State Register of Court Decisions of Ukraine. 96.2 million full-text decisions (1,079 GB in PostgreSQL TOAST), 471 GB of vectors in Qdrant (voyage-3.5, 1024-dim), 28 GB of metadata (court, judge, date, case category, proceeding type, statute code). Breakdown by jurisdiction: civil 33.7M, administrative 14M+, criminal 12M+, commercial 6M+, misdemeanors 6M+. Largest annual cohort — 2024 (115 GB of TOAST text).

OpenReyestr — 43 GB. Ukrainian public registries: 16.7M legal entities (EDR), ownership structures (beneficiaries, shareholders), debtors (State Enforcement Service), NAIS registries. This is the foundation for SneakyPiper — our due-diligence platform — but here it serves as raw corpus for the model.

Legislation — ~40 GB. The Constitution, major codes (Civil, Criminal, Criminal Procedure, Civil Procedure, Commercial Procedure, Administrative Procedure, Labor, Tax, Customs), laws, and secondary legislation. All structurally annotated: articles, parts, clauses, revision dates with effective-date tracking. This isn't flat text: we know that Article 124 of the Constitution took effect on a specific date, carries particular references, and is cited in a precise number of decisions.

Supreme Court review practices + lu_court_decisions — ~25 GB. SC plenary decisions, practice overviews, Grand Chamber rulings. This is the most valuable slice — the legal positions that lower courts follow.

Spanish open data — ~50 GB. BOE (official gazette), AEAT (tax rulings), Tribunal Constitucional (Constitutional Court of Spain), BORME (companies register, section C), CENDOJ (criminal law), Fiscalia, Consejo de Estado, EU-Lex ES. A multilingual bonus: the model gets European legal context in its second working language.

SecondLayer opendata shards — ~30 GB. NIPO (patents/trademarks), DPA data, spending.gov.ua, parliamentary open data (Rada: deputies, bills, votes, legislation texts from zakon.rada.gov.ua), CourtSchedule, CourtExperts.

Total — roughly 2 TB of raw text. After deduplication, boilerplate filtering (standard decision headers, "enters into force upon" clauses, signatures), OCR fixes, and normalization, we expect ~800–1,000 GB of clean tokenized corpus.

In tokens (SentencePiece BPE trained on Ukrainian): approximately 280–330 billion tokens. For comparison, the original DeepSeek V3 was trained on 14.8T tokens, mostly English. Our corpus is 50x smaller, but it's focused, domain-specific, structured, and nearly unique: Common Crawl contains orders of magnitude less Ukrainian legal text.

Why DeepSeek V3 and What 860B Means

DeepSeek V3 is a Mixture-of-Experts (MoE) architecture from DeepSeek: 671B total parameters, 37B active per token. Hot inference is cheaper than dense models of the same scale because only a fraction of experts activates on each forward pass. For our use case — tens of millions of inference calls per month in production — that's critical.

860B is a hypothetical scale: we take the V3 topology and expand it by roughly 1.28x. Specifically: keep 61 layers, increase routed experts from 256 to ~330, retain top-8 routing + 1 shared expert, sigmoid router-gate, balance-loss-free training (as in V3-R1). Total parameters ~860B, active per token ~47B. Still inference-friendly.

Why this particular expansion? First, for a narrow-domain corpus more experts mean better specialized routing: one expert for "filing a claim under CPC," another for "tax rulings," a third for "Supreme Court reasoning in cassation orders." Second, 860B leaves headroom capacity for multilingual coverage (Ukrainian + Spanish + Russian + English) without domain degradation. Third, MoE on TPU v5p scales very cleanly — unlike dense models of the same parameter count.

We'd use the architectural features from the original V3: Multi-Head Latent Attention (MLA) instead of GQA — this reduces KV-cache by roughly 9x, enabling long context (256K tokens) without petabytes of RAM. Multi-Token Prediction (MTP) head as an auxiliary loss during training — improves sampling and unlocks speculative decoding at inference.

Training on GCP: Config and Cost

GCP has TPU v5p pods — the best platform for MoE training, better than H100 clusters in per-chip memory (95 GB HBM3 vs 80 GB) and inter-chip interconnect bandwidth (ICI). For an 860B MoE with 280B tokens, here's the estimate.

Minimum production config: v5p-2048 (2,048 chips, 512 hosts). On this pod, one epoch over 280B tokens completes in roughly 3–4 days. Full pre-training at 3 epochs — 9–12 days of compute time. Hyperparameter search on smaller models (70B/200B variants) — another 5–7 days on v5p-512.

v5p pricing is approximately \4.20 per chip-hour on-demand, \2.50 on a 3-year commitment. At 12 days on v5p-2048, the pre-training run alone comes to \2.5–4.2M. Add another \200–500K for experiments + supervised fine-tuning + DPO/RLHF on a separate judicial instruction dataset. Checkpoint storage in GCS runs ~100–200 GB per checkpoint; over a week you'll accumulate several TB.

Alternative — A3 Ultra (H100 Mega) on GCP. 768 H100s (48 a3-megagpu-8g instances) are roughly equivalent to v5p-1024 in throughput, but worse for MoE efficiency due to NVLink vs ICI. Price is comparable but slightly worse. So — v5p.

Data: the source corpus lives in GCS as multi-stream TFRecord chunks (256 MB each); tokenization happens on-the-fly in the data loader via the JAX/Flax/Paxml stack. This is standard for TPU training, unlike PyTorch/FSDP on H100. Pipeline: TPU chip -> HBM -> TensorCore, no round-trip to host DRAM on the hot path.

Expected Model Properties

What do we get by running this corpus through this much compute?

First: native Ukrainian legal reasoning. As of today, no frontier model truly knows Ukrainian law — not GPT-4o, not Claude Opus 4.7, not Gemini 2.5. They hallucinate Civil Code articles, confuse pre- and post-2022 code revisions, and can't distinguish administrative from civil proceedings. Our model would ingest 280B tokens of Ukrainian legal text — hundreds of times more than any frontier model's pre-training dataset contains.

Second: fine-grained citation. Because the corpus is structured (each chunk carries its doc_id, category, date, article reference), the model learns not just "there's an article in the code somewhere..." but rather "pursuant to Article 611 of the Civil Code of Ukraine (revision of 17.06.2020), in cases concerning recovery of penalties..." This isn't retrieval-augmented; it's a property the model develops in its activations from the pre-training signal itself.

Third: reasoning over precedents. With 96M decisions carrying full metadata (cassation/appellate/first instance, judicial district, reporting judge, date), the model learns how lower courts apply Supreme Court legal positions, how practice evolves over time, and where splits exist between chambers. This is no longer just "information synthesis" — it's legal reasoning trained on real decisions.

Fourth: graph logic for beneficiaries and connections. 16.7M entities in OpenReyestr + SneakyPiper relationship graphs provide raw material for the model to internally build a knowledge graph of the Ukrainian business world. With proper formatting of training samples (triples like "company–beneficiary–ownership %" as text), the model learns to generate hypotheses such as "if person X is the ultimate beneficiary of 3 companies sharing the same attorney, it's worth checking connections with the offshore registry."

Fifth: multilingual bridge function. The Spanish corpus (~50 GB) + EU-Lex ES + Ukrainian legislative texts creates a mapping between EU and Ukrainian criminal-law concepts — useful for extradition matters, MLAT requests, and cases with a foreign element. This isn't professional translation; it's a shared reasoning space.

Sixth: radically lower hallucination on domain queries. We expect that on a test set measuring "correct answer with article/precedent citation" we'd achieve 85–92% accuracy — compared to 40–55% for general-purpose frontier models. This is an experimental estimate, but on small variants (7B/70B fine-tuned on a corpus subset) we already see these numbers.

What the model would NOT do better than frontier models: general reasoning outside jurisprudence, math, code, creative writing in non-legal genres, niche English-language context. For those, production retains multi-model orchestration: lightweight queries go to a quick model, complex legal queries to our own, general queries to Claude/GPT.

What This Means for SecondLayer in Production

Right now we run multi-agent orchestration: intent classifier, retrieval planner, embedding via Voyage, Qdrant search, context building, query to GPT-4o/Claude, post-processing. This is expensive (\$0.01–0.05 per query), slow (3–8 seconds per response), and dependent on OpenAI/Anthropic not cutting off Ukraine tomorrow.

With our own model:

Inference at half the cost of OpenAI at comparable domain quality, because we don't pay for tokens that went into general pre-training
1–2 second latency instead of 3–8, because the query no longer travels trans-Atlantic through a retrieval pipeline
Self-hosted on EU servers, GDPR-compliant, with no dependency on an external provider
Ability to fine-tune for new task types (tax, labor, attorney ethics) without paying for retraining frontier models

The key insight: what we currently have on disk isn't just "data." It's the world's largest domain corpus for training a Ukrainian legal AI model. No foreign player has this corpus and won't have it for years. No open dataset (Pile, RedPajama, Dolma, FineWeb) comes close to containing this much judicial practice from any jurisdiction.

The question isn't whether it's worth doing. The question is when and with whom. \$3–5M for pre-training is seed-to-Series-A territory — this is done with a single strategic investor who sees the Ukr-legal-AI market as a distinct category. We already have the pipeline, the corpus, and the team that keeps prod running on 96M decisions without downtime.

Next — compute.

Author: Volodymyr Ovcharov. legal.org.ua

Originally published on legal.org.ua.