DEV Community

Cover image for NyayAI: Building an AI Legal Assistant for 1.4 Billion People — A Technical Deep Dive
Ashish Raj
Ashish Raj

Posted on

NyayAI: Building an AI Legal Assistant for 1.4 Billion People — A Technical Deep Dive

I'm building a startup to make Indian law accessible to every lawyer, law student, and citizen in the country. Here's the technical story of how I went from zero to a working prototype — training a foundation model from scratch, fine-tuning on 4,000 instruction pairs, building a production-ready RAG pipeline, and shipping a premium SaaS product — all as a solo founder.


The Problem

India has 1.4 billion people and roughly 50 million active legal cases pending in its courts. Lawyers spend hours — sometimes days — digging through bare acts, constitutional articles, and decades of Supreme Court judgments just to find relevant precedents for a single case. The Indian legal system operates across 25+ High Courts, hundreds of tribunals, and a Supreme Court that has delivered judgments since 1950. The sheer volume is staggering.

And yet, the tooling available to lawyers is stuck in 2005. Paid databases like SCC Online and Manupatra charge thousands per month and still require manual keyword searches. Free resources like Indian Kanoon are search-only — no summaries, no analysis, no drafting. Generic AI tools like ChatGPT hallucinate case names, invent sections that don't exist, and have no depth in Indian law.

I wanted to change that.

NyayAI (न्याय = justice in Sanskrit) is an AI-powered legal assistant that understands Indian law — not superficially, but deeply. It can look up any section of any central act, summarize Supreme Court judgments, answer complex legal questions with grounded citations, and eventually draft legal documents. Think of it as ChatGPT, but one that actually passed the bar exam for Indian law.

Why Not Just Use ChatGPT?

This is the question I get asked most often. The answer is simple: a general-purpose model is broad intelligence; NyayAI is domain infrastructure for Indian law.

A general-purpose model like ChatGPT:

  • Does not maintain a live, structured legal retrieval index internally
  • Cannot guarantee exact citations from 43,000+ judgments
  • May compress or approximate precedent chains
  • May hallucinate paragraph numbers or holdings occasionally
  • Is optimized broadly across all domains, not specifically for Indian jurisprudence

NyayAI is specifically engineered for:

  • Indian legal retrieval — semantic search over the full corpus of Supreme Court judgments
  • Citation-grounded answers — every response is backed by actual legal text, not model memory
  • Statute + precedent linking — connecting Constitutional articles, Central Acts, and case law
  • Structured metadata retrieval — case title, bench, citation number, year, disposal type
  • Legal-domain-specific RAG — retrieval-augmented generation tuned for jurisprudence

The analogy is precise: GitHub Copilot is better than raw autocomplete for coding. Bloomberg exists despite Google. Westlaw exists despite search engines. NyayAI exists because Indian law deserves its own intelligence layer.

This blog post is a technical deep dive into everything I've built — the data pipelines, the model architecture decisions, the training infrastructure, the RAG pipeline, the production frontend, and the results. Every number, every decision, every failed experiment is documented here.


Phase 0: The 103M Parameter Experiment (The Learning Phase)

Before touching any pretrained model, I wanted to understand transformers at the deepest level. Not "import transformers and call .fit()" — I mean implementing a GPT-style transformer from scratch in PyTorch.

Architecture

I built a decoder-only transformer with the following specifications:

Parameter Value
Total Parameters 103,457,280 (~103M)
Layers 9
Attention Heads 12
Embedding Dimension 768
Context Window 512 tokens
Vocabulary Size 50,257 (GPT-2 tokenizer)
Output Head Weight-tied with embedding layer

The model was trained on 269 million tokens (1.25 GB) of Indian legal text — the same corpus I'd later use for the production pipeline. Training ran on NVIDIA A100 GPUs via Modal for 2 epochs across 59,000 gradient steps.

Results

Metric Value
Final Validation Loss 2.46
Perplexity 11.7
Training Time ~8 hours

A perplexity of 11.7 on legal text means the model learned the structure and vocabulary of Indian legal language reasonably well. It could generate coherent legal-sounding text, but it was not a useful model — it had no instruction-following capability and no factual grounding. It was a learning exercise, and it served its purpose brilliantly.

Key Takeaway: Building a transformer from scratch taught me more about attention mechanisms, positional encoding, loss landscapes, and gradient dynamics than any course or paper ever could. If you're serious about ML, I strongly recommend doing this at least once.


Phase 1: Data Acquisition — The Foundation of Everything

A model is only as good as its data. For NyayAI, I needed three categories of legal text:

  1. The Constitution of India — the supreme law, 395+ articles
  2. Central Acts (Bare Acts) — the 858 laws passed by Parliament
  3. Supreme Court Judgments — 75 years of case law (1950–2025)

1A. The Constitution of India

Source: A structured JSON file containing all articles with metadata (article number, title, description).

Pipeline: A straightforward JSON-to-text converter that:

  • Parses each article from the JSON
  • Cleans escaped newlines and normalizes whitespace
  • Preserves repealed articles with notation
  • Formats as structured text with Article N — Title headers
  • Separates each article with <|endoftext|> tokens for clean document boundaries

Output:

Metric Value
Articles Processed 395+ (including amendments)
File Size 502 KB
Estimated Tokens ~106,000

The Constitution is small but dense — every article matters. The Preamble alone is one of the most frequently cited legal texts in Indian jurisprudence.

1B. Central Acts (858 Bare Acts)

This was significantly more complex. India has 858 central acts in force, ranging from the Indian Penal Code (1860) to the Digital Personal Data Protection Act (2023). These were stored as deeply nested JSON files with a schema that included:

Act Title, Act ID, Enactment Date, Act Definition
├── Chapters/Parts
│   ├── Sections
│   │   └── Paragraphs (strings or nested dicts with text/contains)
│   └── Subheadings
│       └── Sections
├── Schedules, Annexures, Appendix, Forms
└── Footnotes
Enter fullscreen mode Exit fullscreen mode

Pipeline: A recursive JSON traversal engine that:

  1. Handles BOM encoding — many Indian government JSON files contain a byte-order mark
  2. Recursively extracts paragraphs — handles arbitrarily nested text/contains structures with proper indentation
  3. Cleans legislative artifacts — removes footnote reference numbers, strips decorative markers
  4. Sorts sections numerically — a custom sort function ensures Section 2 comes before Section 10
  5. Processes chapters, subheadings, schedules, annexures, and footnotes — preserving the full hierarchical structure
  6. Outputs with <|endoftext|> boundaries between each act

Output:

Metric Value
Acts Processed 858
File Size 29.9 MB
Total Words ~5,076,000
Estimated Tokens ~6,600,000

1C. Supreme Court Judgments (1950–2025)

This was the heavy lift — and the most valuable data. The Supreme Court of India has delivered tens of thousands of judgments over 75 years. I sourced these from the AWS Open Data Registry (s3://indian-supreme-court-judgments), a public bucket containing judgment PDFs and metadata JSONs organized by year.

Step 1: Download

  • Uses boto3 with unsigned requests (public bucket, no auth needed)
  • Downloads English judgment tar files and metadata tar files for each year (1950–2026)
  • Implements resume support — skips files that already exist with correct size
  • Progress logging with download speed tracking

Step 2: Extract & Process

This is the most complex pipeline in the entire project. It:

  1. Extracts metadata tars — unpacks year-by-year JSON metadata files
  2. Parses metadata HTML — each judgment's metadata is stored as raw HTML. A dedicated parser extracts:
    • Case title (petitioner vs respondent)
    • Judges/Coram
    • Decision date
    • Case number
    • Bench size
    • Citation
    • Disposal nature
  3. Extracts text from PDFs — uses PyMuPDF (fitz) to extract text from judgment PDFs, then cleans:
    • Page headers/footers ("SUPREME COURT REPORTS", standalone page numbers)
    • Excessive whitespace
    • Year-only lines (standalone "1950", "2023", etc.)
  4. Matches PDFs to metadata — correlates each PDF with its extracted case metadata by path key
  5. Formats each judgment as a structured document with a header block (title, citation, case number, date, bench, disposal) followed by the full judgment text
  6. Processes year-by-year — streams output to avoid loading 1.5 GB of text into memory at once

Output:

Metric Value
Judgments Processed 43,324
File Size 1.49 GB (1,588,861,395 bytes)
Total Words ~261,000,000
Estimated Tokens ~339,300,000
Time Span 1950–2025 (75 years)

Total Corpus Summary

Source File Size Tokens (est.)
Constitution of India 502 KB ~106K
Central Acts (858 acts) 29.9 MB ~6.6M
SC Judgments (43,324 cases) 1.49 GB ~339.3M
Total ~1.52 GB ~346 Million

This is a genuinely massive legal corpus — 346 million tokens of structured, cleaned Indian legal text spanning 75 years of Supreme Court jurisprudence, the entire Constitution, and every central act in force.


Phase 1.5: Synthetic Instruction Dataset Generation

A language model that can continue legal text is interesting but not useful. To make it follow instructions — answer questions, summarize cases, compare sections — I needed an instruction-response dataset.

Creating thousands of high-quality legal Q&A pairs by hand was not feasible. Instead, I built a synthetic data generation pipeline using Google's Gemini API.

The Approach

  1. Random chunk sampling — for each batch, randomly select a ~40,000 character chunk from one of the three source files, with a weighted distribution:

    • 60% Supreme Court judgments (largest, most diverse)
    • 30% Central Acts (statute-heavy, structured)
    • 10% Constitution (fundamental, frequently referenced)
  2. Structured prompting — each chunk is sent to gemini-3.1-flash-lite with a carefully crafted prompt that enforces:

    • No hallucination — responses must be based strictly on the provided text excerpt
    • Diversity in length and complexity — each batch of 5 pairs follows a prescribed format:
      • Task 1: Very Long (3-4 paragraph comprehensive summary/brief)
      • Task 2: Medium (legal argument/analysis)
      • Task 3: Medium (comparison of concepts)
      • Task 4: Short (direct factual question)
      • Task 5: Short (yes/no client question with explanation)
    • Structured output — uses Pydantic models with response_mime_type: application/json for reliable parsing
  3. Incremental saving — pairs are appended to a JSONL file as they're generated, with a running count. Supports resume (checks existing pair count on startup).

  4. Rate limiting — 4-second sleep between requests to respect the free tier (15 RPM).

Output

Metric Value
Generated Pairs ~4,000
File Size 2.09 MB
Source Distribution 60% judgments, 30% acts, 10% constitution
Generation Model Gemini 3.1 Flash Lite
Cost $0 (free tier API)

Training Data Distribution Analysis

After generation, I analyzed the response length distribution — this turned out to be a critical insight for understanding model behavior later:

Stat Value
Median response 110 words (~150 tokens)
25th percentile 45 words
75th percentile 147 words (~200 tokens)
Max response 367 words (~490 tokens)
Under 100 words 45% of all training data
100-200 words 37%
200-400 words 18%
400+ words 0%

This distribution matters enormously: the model will learn to produce responses at the length distribution it was trained on. More on this in Phase 4B.

The critical insight here: the quality of your instruction data matters far more than quantity. The original Stanford Alpaca paper used only 52K pairs to teach instruction-following to LLaMA. For a domain-specific model, 2,000-4,000 high-quality, grounded pairs are more than enough — as long as they're diverse in task type and faithful to the source material.


Phase 2: Fine-Tuning — Teaching the Model Indian Law

With data in hand, it was time to take a state-of-the-art pretrained model and teach it to be an Indian legal expert.

Model Selection: Qwen-3 4B Instruct

After evaluating several sub-6B parameter models (Phi-4-mini, SmolLM3-3B, Gemma-3n-E2B), I chose Qwen-3 4B Instruct (2507 variant) for several reasons:

Factor Why Qwen-3 4B
Reasoning Exceptional chain-of-thought and instruction following
Multilingual Strong Hindi support (critical for Indian legal market)
Architecture Modern optimizations, efficient attention
Ecosystem Massive HuggingFace community, well-documented
License Apache 2.0 — fully commercial use
Size 4B parameters — fits in a single L4 GPU (24GB) in bfloat16

Training Infrastructure

Everything runs on Modal — a serverless GPU cloud that lets you define your entire training pipeline in a single Python file and run it with one command. The entire training pipeline — from data loading to checkpoint saving — executes remotely on Modal. Checkpoints are saved to a Modal Volume and automatically downloaded to my local machine after each epoch.

LoRA: Training Smart, Not Expensive

Fine-tuning all 4 billion parameters would require multiple GPUs and cost hundreds of dollars. Instead, I implemented LoRA (Low-Rank Adaptation) from scratch — no HuggingFace PEFT library, no Unsloth, no shortcuts.

How LoRA Works

Instead of updating the full weight matrix W (size d × d), LoRA decomposes the update into two small matrices:

W' = W + α(A × B)
where A is (d × r) and B is (r × d), and r << d
Enter fullscreen mode Exit fullscreen mode

For rank r=16 and dimension d=768, instead of updating 589,824 parameters per layer, you're updating 16×768 + 16×768 = 24,576 parameters — a 24x reduction.

Implementation

class LORALayer(torch.nn.Module):
    def __init__(self, in_dim, out_dim, rank, alpha):
        super().__init__()
        self.A = torch.nn.Linear(in_dim, rank, bias=False)
        self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))
        self.alpha = alpha

    def forward(self, x):
        return self.alpha * (self.A(x) @ self.B)

class LinearWithLoRA(torch.nn.Module):
    def __init__(self, linear, rank, alpha):
        super().__init__()
        self.linear = linear
        self.lora = LORALayer(linear.in_features, linear.out_features, rank, alpha)

    def forward(self, x):
        return self.linear(x) + self.lora(x)
Enter fullscreen mode Exit fullscreen mode

The B matrix is initialized to zeros, so at the start of training, LoRA(x) = α × (A(x) @ 0) = 0. The model starts exactly where the pretrained model left off — no disruption. As training progresses, the LoRA layers learn domain-specific adaptations while the base model stays frozen.

Target Modules

LoRA adapters were injected into the attention layers only:

lora_target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]
Enter fullscreen mode Exit fullscreen mode

Hyperparameters

Parameter Value Rationale
LoRA Rank 16 Sweet spot: enough capacity for domain adaptation without overfitting on ~4K pairs
LoRA Alpha 32 α/r = 2.0 scaling factor — standard choice
Peak Learning Rate 2e-5 Conservative — avoiding catastrophic forgetting of base model knowledge
Minimum Learning Rate 2e-6 10x decay from peak
Warmup Steps 50 Quick ramp to prevent early instability
Batch Size 4 Fits in L4 VRAM with gradient checkpointing
Max Sequence Length 8,192 Full context window of Qwen-3
Weight Decay 0.1 Standard regularization
Gradient Clipping 1.0 (max norm) Prevents exploding gradients on long legal sequences
Optimizer AdamW Only over LoRA parameters
Precision bfloat16 Native on L4, no precision loss for this scale
Epochs 2 Sufficient for convergence on this dataset size

Parameter Efficiency

Category Count
Total Model Parameters ~4,000,000,000
Frozen (Base Model) ~3,988,200,000
Trainable (LoRA) ~11,800,000
Parameter Ratio ~0.30%

We're training less than 0.3% of the model's parameters. The LoRA adapter checkpoint is ~135 MB — compared to the full model's ~8 GB in bfloat16.

Data Formatting: ChatML

Every instruction-response pair is formatted in ChatML (the template Qwen expects):

<|im_start|>system
You are an expert Indian Legal Assistant.<|im_end|>
<|im_start|>user
What are the key provisions of Section 14 of the Hindu Succession Act?<|im_end|>
<|im_start|>assistant
Section 14 of the Hindu Succession Act, 1956, is a landmark provision...<|im_end|>
Enter fullscreen mode Exit fullscreen mode

Custom Collation: Dynamic Batch Padding

Rather than padding all sequences to the maximum model length (8,192 tokens), I implemented dynamic batch padding — each batch is padded only to the length of its longest sequence. This saves enormous amounts of compute. If a batch's longest sequence is 1,200 tokens, we're processing 1,200 × 4 = 4,800 tokens instead of 8,192 × 4 = 32,768 tokens. On average, this reduces compute by ~70-80%.

Learning Rate Schedule: Cosine with Linear Warmup

  1. Linear warmup (0 → 2e-5 over 50 steps) — prevents early training instability
  2. Cosine decay (2e-5 → 2e-6 over remaining steps) — smooth convergence without sharp drops

Memory Optimization: Gradient Checkpointing

With 4B parameters in bfloat16, the model alone takes ~8GB of VRAM. Add optimizer states, gradients, and activations for 8,192-token sequences, and you blow past 24GB easily. Gradient checkpointing trades ~30% more compute time for ~40% VRAM savings — the difference between fitting and OOM.

Fault-Tolerant Training: The Generator Pattern

Training on cloud GPUs can fail for many reasons — preemption, network issues, timeouts. The training loop uses Python's generator pattern (yield) to stream results back to the local machine after each epoch. This means even if training crashes after epoch 1, I already have the checkpoint downloaded locally.

Training Results

Training ran for 2 full epochs on an NVIDIA L4 GPU (24GB VRAM) via Modal.

Metric Epoch 1 End Epoch 2 End (Final)
Training Loss ~1.05 ~0.69
Validation Loss ~1.00 ~0.92
Learning Rate ~1.2e-5 (mid-decay) ~2e-6 (minimum)
Tokens Processed ~4.5M ~9.0M
Global Steps ~850 ~1,700

Key Observations

  1. Smooth convergence — no loss spikes, no instability. The warmup + cosine schedule + gradient clipping combination worked perfectly.
  2. No overfitting — validation loss tracked training loss closely throughout. The gap widened slightly in epoch 2 (0.69 vs 0.92), which is expected and healthy.
  3. Rapid initial learning — the steepest loss drop happened in the first 200 steps of epoch 1, as the model quickly adapted to the legal domain's vocabulary and style.
  4. Diminishing returns in epoch 2 — most of the learning happened in epoch 1. Epoch 2 provided refinement but the marginal improvement was smaller.

Phase 3: The Production RAG Pipeline — Architecture, Sharding, & Serving

A fine-tuned model knows how to talk like a legal expert, but it doesn't remember specific facts. When a lawyer asks "What does Section 34 of the Indian Trusts Act say?", a model might generate something that sounds legally plausible but is entirely fabricated.

To solve this, I designed and built a production-grade, highly optimized RAG (Retrieval-Augmented Generation) pipeline. This lookup mechanism allows our fine-tuned Qwen model to query a massive vector database of Indian law, extract the exact legal provisions, and generate answers strictly grounded in the source material with pinpoint citations.


3A. LoRA Adapter Merging

Running a model with active LoRA weights in production adds computational overhead and complicates serving. To achieve maximum inference speed and simplify deployment, I mathematically blend the LoRA weights directly into the base Qwen-3 4B parameters:

$$W_{\text{merged}} = W_{\text{base}} + \frac{\alpha}{r} (A \times B)$$

  • Result: Fused 144 adapter projection layers in exactly 20.4 seconds. The final standalone model (~7.5 GB in bfloat16 precision) was saved directly to the persistent Modal Volume.

3B. Structure-Aware Legal Chunking

Legal documents have natural, highly structured segmentations (articles, sections, subsections). Naive chunking (e.g., splitting every 500 characters blindly) splits legal clauses in half, completely ruining retrieval precision.

I built a structure-aware chunking pipeline that parses the three source document types into structured chunks while preserving critical legal metadata mappings:

  1. Constitution of India: Split by Article bounds → 468 chunks (average 1,025 characters).
  2. Central Acts: Split recursively by Section bounds → 23,152 chunks (average 1,364 characters).
  3. Supreme Court Judgments: Split by structured paragraphs, with metadata headers (case title, citation, bench, year) prepended to each chunk → 330,673 chunks (average 4,756 characters).
  • Output: 354,293 chunks compiled into a single 1.6 GB file. Each chunk contains its text, chunk_id, and a metadata dictionary mapping its original source attributes (e.g., article_number, act_title, section, case_title, citation, bench, year).

3C. Massively Parallel GPU Map-Reduce Embedding

Generating vector embeddings for 354,293 documents using a state-of-the-art multi-lingual model (BGE-M3) would take days on a single machine. To solve this, I built a highly distributed Map-Reduce pipeline using Modal.

graph TD
    A[354,293 chunks] --> B[Coordinator Function]
    B -->|Split into 32 shards| C[Shard Inputs]

    C -->|Shard 0| D1[L4 GPU Worker 1]
    C -->|Shard 1| D2[L4 GPU Worker 2]
    C -->|...| D3[L4 GPU Worker ...]
    C -->|Shard 31| D4[L4 GPU Worker 32]

    D1 -->|Embed FP16| E1[11,000 vectors]
    D2 -->|Embed FP16| E2[11,000 vectors]
    D3 -->|Embed FP16| E3[... vectors]
    D4 -->|Embed FP16| E4[11,000 vectors]

    E1 --> F[Reduce / Concatenate]
    E2 --> F
    E3 --> F
    E4 --> F

    F --> G[(FAISS Index FlatIP <br> 354,293 x 1024)]
    F --> H[(SQLite chunk_lookup.db)]
Enter fullscreen mode Exit fullscreen mode
  1. The Map Phase: The coordinator divides the 354K chunks into 32 shards (~11,000 chunks per shard). Modal automatically spins up 32 parallel L4 GPU containers in the cloud simultaneously.
  2. Pre-Caching & Instant Boot: The BGE-M3 model weights are baked directly into the Docker image layer, bypassing HuggingFace downloads and enabling the GPU servers to boot instantly.
  3. FP16 Inference: Each worker runs native PyTorch float16 inference over its 11,000 texts, generating normalized dense embeddings in a fraction of the time.
  4. The Reduce Phase: The coordinator gathers the 32 output matrices, concatenating them in chronological order into a single dense matrix of shape (354293, 1024).
  5. FAISS Index Compilation: The combined embeddings are fed into a FAISS IndexFlatIP (Cosine similarity) index and saved. Simultaneously, a SQLite lookup database is generated on the volume.
  6. Compute Time: The entire parallel sharding execution finished in under 20-30 minutes of total wall time.

3D. Production FastAPI Serving & Optimizations

To serve the RAG assistant, I built an extremely optimized FastAPI server hosted on Modal. It loads the merged Qwen model and BGE-M3 on a single cost-effective L4 GPU.

1. Zero-RAM SQLite Lookup Database (Startup Optimization)

  • The Problem: Reading the 1.6 GB chunk lookup JSON into container memory on boot takes almost 2 minutes and consumes 1.6 GB of RAM.
  • The Solution: On first startup, the server streams the JSON file line-by-line and compiles a local SQLite database directly on the persistent volume (took 92.6s). On subsequent boots, the JSON is completely bypassed.
  • The Result: The server opens a thread-safe SQLite connection instantly on boot (0.001 seconds) and consumes 0 MB of startup RAM overhead.

2. VRAM Autocasting & Thread-Safe Real-time Streaming

  • Autocasting: Inside the generation thread, both token lookup and model generation are wrapped in torch.inference_mode() and torch.autocast(device_type="cuda", dtype=torch.bfloat16) to guarantee zero memory spikes.
  • ASGI Protection: Real-time token streaming is exposed via Server-Sent Events (SSE) at /api/ask/stream. Because LLM token generation is CPU/GPU bound, running it synchronously inside an async FastAPI server freezes the async event loop. I wrapped the TextIteratorStreamer inside a separate native OS Thread and fed tokens into a synchronous streaming generator.
  • Strict EOS Enforcement: The system dynamically extracts the <|im_end|> and <|endoftext|> token IDs at tokenizer boot to strictly enforce early stopping and prevent hallucinations.

3. Absolute Cost Safety

  • The server uses min_containers=0. When idle, it scales down to zero GPU containers, costing exactly $0.00 in hosting fees. Cold start boots in ~10 seconds.

3E. Verification & End-to-End Test Results

Both endpoints were verified against the active server. The results are spectacular and highly accurate:

1. Blocking API Endpoint (/api/ask)

  • Query: "What does Article 21 of the Indian Constitution guarantee?"
  • Status: 200 OK
  • Total Latency: 5.34 seconds
  • Generated Answer: > Article 21 guarantees the right to life and personal liberty. The Supreme Court has interpreted this right expansively, noting that it is not limited to mere survival but encompasses the right to live with dignity. This includes the right to privacy, which is viewed as an inalienable component of personal liberty.
  • Sources Used:
    1. [SC_JUDGMENTS] Supreme Court: K.S. Puttaswamy v. Union of India (2017)
    2. [SC_JUDGMENTS] Supreme Court: Common Cause v. Union of India (2017)
    3. [SC_JUDGMENTS] Supreme Court: X v. Union of India (2023)

2. Streaming API Endpoint (/api/ask/stream)

  • Query: "What are the grounds for divorce under the Hindu Marriage Act?"
  • Status: 200 OK
  • Stream Event 1 (Metadata Block): Source citations with full case metadata
  • Stream Event 2+ (Word-by-Word Tokens): Real-time legal analysis streaming

Phase 4A: The Full-Stack SaaS Product

With a working RAG backend, I built a complete production-grade web application — not a demo, not a Gradio wrapper, but a real SaaS product with authentication, streaming, and a premium UI.

Technology Stack

Layer Technology
Frontend Framework Next.js 16 (App Router)
Deployment Vercel (via CLI, no GitHub push required)
Styling CSS Modules with custom design system
Fonts Playfair Display (headings) + Inter (body) via Google Fonts
Icons Lucide React
Backend Proxy Next.js API Routes → Modal FastAPI
Authentication Access-code gate with server-side cookie validation

Design System

The UI uses a dark professional theme with deep navy (#0F1B2D) + gold (#C89D4A) branding — deliberately chosen for extended legal research sessions. No glassmorphism. Minimal, authoritative, and clean.

Token Value Usage
--navy #0F1B2D Primary background
--navy-light #162337 Card/surface backgrounds
--gold #C89D4A Accent, branding, active states
--white #EAEAEA Primary text
--gray-300 #B0B8C1 Secondary text

Architecture Flow

graph LR
    A[User Browser] -->|HTTPS| B[Vercel CDN]
    B -->|Auth Cookie| C[Next.js API Route]
    C -->|SSE Stream| D[Modal FastAPI]
    D -->|FAISS Query| E[(Vector Index)]
    D -->|SQLite Lookup| F[(Chunk DB)]
    D -->|Qwen-3 4B| G[L4 GPU]
    G -->|Token Stream| D
    D -->|SSE Response| C
    C -->|Stream to Client| A
Enter fullscreen mode Exit fullscreen mode

Key Components

  1. Access Gate — access-code authentication with server-side cookie validation. Protects the chat interface from unauthorized access.

  2. Chat Interface — real-time streaming chat with auto-scroll, message bubbles, and a loading state that progresses through multiple stages ("Searching 354,293 legal documents..." → "Analyzing relevant precedents..." → "Constructing legal context..." → "Warming up GPU inference engine...").

  3. Citation Cards — each source citation is rendered as an expandable card showing:

    • Source type badge (SC Judgment / Central Act / Constitution)
    • Case title (with intelligent fallback extraction from chunk text)
    • Year, citation number, bench composition
    • Full metadata grid when expanded
    • Actual chunk text (first 300 characters)
  4. Collapsible Citations — citations are grouped by source type with a summary bar: "6 SC Judgments · 1 Constitution · 1 Central Act". Collapsed by default to keep the focus on the answer.

  5. Confidence Bar — displays: "✓ 8 sources retrieved · Avg relevance: 87%"

  6. Sample Prompts — curated legal questions on the empty state, tuned for strong demo performance.

  7. Branded Favicon — custom SVG: gold "N" monogram with balanced scales of justice on deep navy background.


Phase 4B: The Refinement Layer — Production-Grade Tuning

After the initial deployment, I systematically addressed every production issue. This phase was the difference between "it works" and "it works well."

System Prompt Engineering

The original system prompt was 5 generic lines. I rewrote it into a 20-line structured instruction set that forces:

  • Exact case name citations (no "Supreme Court Judgment" placeholders)
  • Chronological ordering for historical/evolution queries
  • Bullet points for distinct legal holdings
  • No repetition across paragraphs
  • Senior legal researcher tone

Hierarchical Response Modes

A critical learning: the model was fine-tuned on responses with a median length of 110 words. It learned to hit EOS (end of sentence) at ~150-180 tokens regardless of max_new_tokens. The system prompt alone couldn't override this trained behavior.

The solution: hierarchical prompting with three modes.

Mode min_new_tokens System Prompt Instruction Use Case
Concise 0 (no floor) "Brief, direct answer in 2-4 sentences" Quick factual lookups
📖 Detailed 150 "Detailed analysis with case references, chronological ordering" Standard legal questions
🎓 Research 350 "Full legal research memo: case-by-case breakdown, reasoning, holdings, evolution, current position" Deep analysis, investor demos

Each mode dynamically adjusts both the system prompt AND the min_new_tokens parameter in model.generate(). The user sees three pill buttons above the input field.

The insight: max_new_tokens is a ceiling, not a target. It says "generate at most this many tokens." But the model stops when it hits an EOS token. min_new_tokens tells the model: "you cannot stop generating until you've produced at least N tokens." Combined with a structured prompt that asks for detailed analysis, the model fills those extra tokens with actual substance.

Source-Aware Retrieval Routing

The original RAG pipeline returned the top-k nearest vectors from FAISS regardless of query intent. If you asked about Article 21 (Constitution), you might get 8 SC Judgment chunks and zero Constitution chunks — because judgment text is more verbose and often embeds better.

The fix: _enforce_source_diversity()

  1. Query intent detection — regex-based analysis detects if the query targets Constitution articles, Central Acts, or SC Judgments
  2. Over-retrieval — FAISS retrieves top_k * 2 candidates (16 instead of 8)
  3. Intelligent reranking — if the query targets Constitution but results are all judgments, Constitution chunks are boosted in the reranking

Metadata-Grounded Citation Cards

A persistent bug: citation cards showed "Supreme Court Judgment" instead of the actual case title (e.g., "Pritam Singh v. The State"). The case_title metadata was sometimes missing from older chunks.

The fix: Two-layer fallback:

  1. Backend now sends text_chunk (first 300 characters of chunk text) in the streaming sources payload
  2. Frontend extracts the case title from the chunk text using regex: "Supreme Court of India — CASE TITLE""CASE TITLE"

Comprehensive Modal Logging

Every inference request now logs:

  • User query and parameters
  • Response mode and min_tokens configuration
  • FAISS retrieval distances and case titles for each chunk
  • Source routing decisions
  • Prompt token count
  • Full model output text
  • Per-stage latency (retrieval, generation, total)
  • Generated token count

All visible in the Modal dashboard for live monitoring.

Mobile Responsive Design

Citation cards that were visually dominant on mobile screens were redesigned:

  • Compact padding and smaller text
  • Single-column metadata grid (instead of 2-column)
  • Scrollable chunk text with max-height: 200px
  • Source badges at reduced size

Infrastructure Costs

Phase GPU Time Cost
Data Processing CPU only ~2 hours $0 (local)
Synthetic QA Generation None (API) ~6 hours $0 (Gemini free tier)
Fine-Tuning (2 epochs) L4 (24GB) ~2 hours ~$3-5 (Modal)
Embedding (32x sharded) L4 × 32 ~30 min ~$3-4 (Modal)
Frontend Hosting $0 (Vercel free tier)
Backend Hosting (idle) $0 (Modal scales to zero)
Total < $10

Read that again. The entire pipeline — from raw legal text to a fine-tuned 4B parameter model with RAG, streaming, and a production SaaS frontend — cost less than a meal.


Technical Specs Summary

Component Specification
Base Model Qwen-3 4B Instruct (2507)
Merged Model Standing bfloat16 standalone weights (~7.5 GB)
Embedding Model BAAI/bge-m3 (dense vector, FP16 precision)
FAISS Vector Index IndexFlatIP (Cosine Similarity, 1024 dimensions)
Total Database Chunks 354,293 chunks (1.6 GB corpus)
Lookup Engine Thread-safe local SQLite database
Server Framework FastAPI (with SSE token streaming)
Concurrence Model Native multi-thread worker with TextIteratorStreamer
API Endpoints /api/ask (Blocking), /api/ask/stream (SSE Streaming)
Response Modes Concise / Detailed / Research (hierarchical prompting)
Frontend Next.js 16 (App Router) on Vercel
Authentication Access-code gate with server-side cookies
Hosting Platform Modal (Backend) + Vercel (Frontend)
GPU Target NVIDIA L4 (24GB VRAM)
Production Scale min_containers=0 (Scales to zero when idle for $0.00/hr)
E2E Average Latency ~5 seconds for full answer / real-time for streaming
Total Build Cost < $10

Lessons Learned

1. Data Quality > Data Quantity

4,000 carefully structured instruction pairs, generated from real legal text with strict anti-hallucination prompting, taught the model more than 50,000 sloppy pairs would have. The key was enforcing diversity in both task type (summaries, comparisons, Q&A, yes/no) and length (1 sentence to 4 paragraphs).

2. Training Data Distribution Dictates Model Behavior

The model's output length is not controlled by max_new_tokens — it's dictated by the distribution of response lengths in the training data. With a median training response of 110 words, the model consistently hits EOS at ~150-180 tokens. The fix isn't bigger max_new_tokens — it's either retraining with longer responses or using min_new_tokens with structured prompts.

3. Hierarchical Prompting is High ROI

Instead of a one-size-fits-all prompt, implementing response modes (Concise/Detailed/Research) with mode-specific system prompts and min_new_tokens floors gives users control over response depth. This was suggested during product critique and turned out to be the single highest-ROI improvement for user experience.

4. Source Diversity Matters More Than Raw Similarity

FAISS returns the most semantically similar chunks, but similarity ≠ utility. A Constitution query returning 8 judgment chunks (because judgments embed better) is technically correct but practically useless. Source-aware reranking that considers query intent dramatically improves answer quality.

5. Standalone Merged Models are Faster and Cleaner

Merging the LoRA weights directly into the base parameters completely eliminated inference-time adapter overhead, trimmed memory footprints, and allowed the base model to load at peak native speeds.

6. Bypass JSON in Production with SQLite

Loading large JSON files (1.6GB+) is a silent killer for cloud instances. SQLite dropped boot overhead from 2 minutes to 0.001 seconds while consuming 0 MB of startup RAM.

7. GPU Sharding for Rapid Large-Scale Embeddings

Attempting to embed 354,000+ texts sequentially is a nightmare. 32 parallel L4 GPUs via Modal allowed us to embed the entire dataset in ~20 minutes for under a few dollars.

8. Always Scale to Zero when Idle

For bootstrapped startups, min_containers=0 on serverless providers like Modal allows hosting a fully functional RAG prototype completely free of charge when idle.

9. Domain Infrastructure Beats General Intelligence

A general-purpose LLM is broad intelligence. NyayAI is domain infrastructure for Indian law. That's similar to how Bloomberg exists despite Google, or how Westlaw exists despite search engines. The value comes from Indian legal corpus specialization, retrieval grounding, citation accuracy, jurisprudence-focused indexing, and workflow optimization for lawyers.


The Stack

PyTorch · Modal · Qwen-3 4B · FAISS · BGE-M3 · SQLite · FastAPI · Next.js 16 · Vercel · Server-Sent Events · LoRA · Cosine Similarity


What's Already Built (75/100)

✅ Data acquisition · ✅ Cleaning · ✅ Chunking · ✅ Embeddings · ✅ Retrieval · ✅ Serving · ✅ Deployment · ✅ Fine-tuning · ✅ Streaming · ✅ Grounding · ✅ Systems optimization · ✅ UX · ✅ Source routing · ✅ Hierarchical prompting · ✅ Citation metadata · ✅ Frontend SaaS

What's Next (25/100)

🔲 Trust · 🔲 Distribution · 🔲 Onboarding · 🔲 User retention · 🔲 Legal partnerships · 🔲 Monetization · 🔲 Sales · 🔲 Adoption loops · 🔲 Reliability · 🔲 Consistency · 🔲 Multilingual access · 🔲 High Court coverage


Built with obsession by a solo founder who believes every Indian deserves access to justice — and that the right AI can make that happen.

© 2026 Ashish Raj. All rights reserved.

Top comments (0)