DEV Community: Ashish Raj

NyayAI: AI-Powered Legal Intelligence for India

Ashish Raj — Mon, 25 May 2026 07:26:49 +0000

Making Indian law accessible, accurate, and affordable for 1.4 billion people.

By Ashish Raj — Founder, NyayAI

May 2026

The Problem We're Solving

India is the world's largest democracy. It is home to 1.4 billion people, one of the oldest continuous legal traditions on the planet, and a Constitution that is widely regarded as one of the most comprehensive ever drafted. And yet, for most Indians, the law remains a black box — expensive to access, slow to navigate, and almost impossible to understand without professional help.

Consider this: India currently has over 50 million pending court cases. Fifty million. That number is not a typo. It is a crisis — a slow-moving, systemic failure that affects every citizen, every business, and every institution in the country. Cases languish for years, sometimes decades. Litigants exhaust their savings. Justice, in too many cases, is not denied outright — it is simply delayed until it becomes meaningless.

Behind those 50 million cases are lawyers — hundreds of thousands of them — who spend hours, sometimes days, manually searching through case law. They sift through volumes of Supreme Court Reports, flip through annotated statutes, and cross-reference precedents by memory or by keyword. The process is slow, error-prone, and exhausting. A single legal research task that should take minutes can consume an entire afternoon.

The tools that exist today are either expensive or inadequate. Platforms like SCC Online and Manupatra are the industry standard, but they come with steep subscription fees that put them out of reach for solo practitioners, junior advocates, and law students. More importantly, they are fundamentally keyword-based search tools — you type in a phrase, and you get back a list of documents that contain that phrase. There is no intelligence. No understanding. No synthesis.

Free alternatives like Indian Kanoon have done admirable work in making legal text available online, but they remain search-only platforms — no analysis, no summarization, no contextual understanding, no citation linking, no structured output. You search, you read, you figure it out yourself.

And then there are the general-purpose AI tools — ChatGPT, Claude, Gemini, and others. They are extraordinary pieces of technology. I use them every day. But when it comes to Indian law, they are dangerously unreliable. They hallucinate case names. They invent statutory sections that do not exist. They cite judgments with confident authority — judgments that were never delivered. They lack depth in Indian jurisprudence, and they have no mechanism to verify or ground their answers in actual legal text.

The result is a painful paradox: India has one of the richest legal traditions in the world, and yet most of its citizens, lawyers, and courts operate without any AI-assisted research, retrieval, summarization, vernacular access, or affordable tooling.

That gap is not hypothetical. It is real, it is massive, and it affects millions of people every single day.

NyayAI exists to close that gap.

What is NyayAI?

NyayAI — the name comes from न्याय (Nyāya), the Sanskrit word for justice — is an AI-powered legal assistant specifically engineered for Indian jurisprudence.

Let me be precise about what that means, because the distinction matters.

NyayAI is not a chatbot wrapper. It is not a thin interface on top of a general-purpose language model. It is not a weekend hackathon project with a legal skin. It is domain infrastructure for Indian law — purpose-built from the ground up to understand, retrieve, and reason over Indian legal text with a level of precision that generic tools simply cannot match.

Think of it this way: Bloomberg exists for finance. Westlaw exists for American law. NyayAI is being built to serve that same function for Indian law.

At its core, NyayAI is grounded in a curated corpus of 354,293 legal documents spanning 75 years of Supreme Court judgments (from 1950 to 2025), 858 Central Acts of the Indian Parliament, and the complete Constitution of India — including all amendments, schedules, and articles. Every answer NyayAI produces is traceable back to an actual legal source. Every citation is real. Every reference can be verified.

This is not artificial intelligence that sounds legal. This is artificial intelligence that is legal — grounded, sourced, and verifiable.

Why Not Just Use ChatGPT?

This is the question I hear most often, and it deserves a thorough answer — because the differentiation is critical.

General-purpose large language models like ChatGPT, Claude, or Gemini are remarkable. They represent some of the most significant technological achievements of our generation. They can write poetry, debug code, summarize research papers, and hold conversations that feel genuinely human. I have enormous respect for the teams that built them.

But they are not optimized for Indian legal workflows. And in law, "not optimized" is not a minor inconvenience — it is a serious liability.

Here is why:

A general-purpose model like ChatGPT does not maintain a live, structured legal retrieval index internally. It does not have a database of 43,000+ Supreme Court judgments that it can search through in real time. When you ask it a legal question, it generates an answer from its training data — which means it is reconstructing legal knowledge from memory, not retrieving it from verified sources.

This leads to several critical problems:

It cannot guarantee exact citations. When ChatGPT cites a case, there is no mechanism to verify that the citation is accurate, that the case exists, or that the holding it describes is correct. It may be right. It may be wrong. You have no way to know without doing the research yourself — which defeats the entire purpose.
It may compress or approximate precedent chains. Legal reasoning depends on the precise chain of precedents — which case cited which, what principle was established, how it was distinguished or overruled. A general-purpose model may summarize this chain in a way that loses critical nuance.
It may hallucinate paragraph numbers, holdings, or even entire judgments. This is not a theoretical risk. It happens regularly. I have personally tested dozens of legal queries on leading AI platforms and found fabricated case names, invented statutory sections, and confidently stated holdings that bear no resemblance to reality.
It is optimized broadly across all domains, not specifically for Indian jurisprudence. The same model that answers your legal question also writes marketing copy, generates recipes, and helps with algebra homework. That breadth is its strength in general use — and its weakness in specialized domains.

NyayAI takes a fundamentally different approach. It is specifically engineered for:

Indian legal retrieval — semantic search over a curated, structured corpus of Indian legal documents
Citation-grounded answers — every claim is backed by a specific, verifiable source
Semantic search over Supreme Court judgments — not keyword matching, but meaning-based retrieval that understands legal concepts
Statute and precedent linking — connecting statutory provisions to the case law that interprets them
Structured metadata retrieval — bench composition, citation numbers, judgment dates, and more
Legal-domain-specific Retrieval-Augmented Generation (RAG) — a pipeline that ensures the AI's responses are anchored in real legal text, not generated from memory

The analogy I use most often is this: GitHub Copilot is better than raw autocomplete for coding. NyayAI is better than raw ChatGPT for Indian law.

Both are built on top of powerful AI. But one is generic, and the other is purpose-built. A generic LLM is broad intelligence. NyayAI is domain infrastructure for Indian law.

That is similar to how Bloomberg exists despite Google, or how Westlaw exists despite search engines. The general tool is powerful. The specialized tool is indispensable.

NyayAI's Core Features

NyayAI is not a concept or a pitch deck. It is a working product — live, deployed, and functional. Here is what it does today:

1. Grounded Legal Answers

Every response NyayAI produces is backed by actual legal sources — not generated from memory, not reconstructed from training data, not hallucinated from statistical patterns. When NyayAI cites a case, that case exists. When it quotes a statutory provision, that provision is real. Citations are traceable, verifiable, and linked directly to the source text.

This is the single most important feature of the platform. In law, an unverifiable claim is worse than no claim at all. NyayAI ensures that every answer has a paper trail.

2. 354,293 Legal Documents Indexed

NyayAI's knowledge base is not a small sample or a curated subset. It encompasses:

Supreme Court Judgments (1950–2025) — 75 years of the highest court's jurisprudence, covering constitutional law, criminal law, civil law, tax law, environmental law, labor law, and every other domain the Court has adjudicated
858 Central Acts — the complete body of parliamentary legislation currently in force
The Constitution of India — all articles, amendments, schedules, and provisions

This is a corpus of 1.52 GB of structured legal text — cleaned, chunked, embedded, and indexed for semantic retrieval.

3. Real-Time Streaming Responses

NyayAI does not make you wait for a complete response before displaying it. Answers stream word by word, in real time, just like the experience you are accustomed to with ChatGPT or other modern AI interfaces. This makes the interaction feel natural, responsive, and fast — even when the underlying analysis is complex.

4. Citation Cards with Metadata

Each source citation in a NyayAI response is presented as a rich citation card that includes:

Case title (e.g., Kesavananda Bharati v. State of Kerala)
Year of judgment
Bench composition
Citation number
The actual chunk of legal text that was used to generate the answer

This is not a footnote with a case name. It is a complete, contextual reference that allows you to evaluate the source yourself.

5. Three Response Modes

Different legal questions require different levels of depth. NyayAI offers three distinct response modes:

Concise — Quick, 2–4 sentence answers for straightforward queries. Ideal when you need a fast answer and already have context.
Detailed — Structured legal analysis with organized sections, relevant precedents, and statutory references. Suitable for most professional research tasks.
Research — A full legal research memo with case-by-case breakdown, comprehensive precedent analysis, and detailed statutory interpretation. Designed for complex legal questions that require thorough treatment.

6. Collapsible Citation Interface

When NyayAI retrieves sources, they are organized into a collapsible interface grouped by source type — Supreme Court Judgments, Central Acts, and Constitution. Each group shows a summary count of the number of sources retrieved, and you can expand or collapse each group to manage the information density. This keeps the interface clean while making every source accessible.

7. Confidence Scoring

Each response includes relevance scores for the retrieved sources. This allows you to assess how closely the source material matches your query. A high-relevance citation on a niche topic is more useful than a tangentially related one — and NyayAI makes that distinction visible.

8. Source-Aware Routing

Not every legal question requires the same type of source material. A question about fundamental rights needs the Constitution. A question about criminal procedure needs the relevant statute. A question about judicial interpretation needs case law. NyayAI's retrieval system intelligently routes queries to the right type of legal document, ensuring that the sources it retrieves are appropriate for the question being asked.

9. Dark Professional UI

NyayAI features a deep navy and gold themed interface designed specifically for extended legal research sessions. The dark theme reduces eye strain during long working hours, while the gold accents convey professionalism and authority. Every element of the interface — from typography to spacing to the citation cards — has been designed with legal professionals in mind.

10. Mobile Responsive

Legal research does not always happen at a desk. NyayAI is fully responsive and works seamlessly on phones, tablets, and desktops. Whether you are in a courtroom, in a meeting, or on a train, the full power of the platform is available in your pocket.

11. Secure Access Gate

Access to NyayAI is protected by an access-code authentication system. This ensures that the platform remains secure and that usage can be managed and monitored during the current phase of development and rollout.

What We Built — The Engineering Summary

I want to give you a sense of what went into building NyayAI without diving into technical jargon. The engineering behind this platform is significant, and it is worth understanding at a high level.

We built a custom legal corpus from scratch. This meant acquiring, ingesting, cleaning, and structuring 1.52 GB of Indian legal text. Raw legal documents are messy — inconsistent formatting, OCR artifacts, encoding issues, missing metadata. Every document in our corpus has been processed, normalized, and structured for machine consumption.

We fine-tuned an AI model specifically on Indian legal instruction pairs. This is not a generic model that happens to answer legal questions. It is a model that has been trained on thousands of examples of Indian legal reasoning — questions and answers, case analysis, statutory interpretation, and constitutional commentary. The model understands legal language, legal structure, and legal reasoning in a way that generic models do not.

We built a semantic search engine over 354,293 legal documents. This is not keyword search. When you ask NyayAI a question, it does not look for documents that contain the exact words you used. It understands the meaning of your question and retrieves documents that are semantically relevant — even if they use different terminology.

We designed a retrieval-augmented generation pipeline that grounds every answer in actual source text. The AI does not answer from memory. It retrieves relevant documents first, then generates its response based on those documents. This is what makes the answers verifiable and trustworthy.

We built a streaming inference server that scales to zero when idle. This means we are not paying for expensive GPU compute when no one is using the platform. When a user sends a query, the server spins up, processes the request, streams the response, and then scales back down. This is critical for cost efficiency at our current stage.

We built a modern web application with a premium interface. The frontend is fast, responsive, and professionally designed. It is not an afterthought or a demo UI — it is a production-quality application built for real users.

We deployed globally — the AI backend runs on GPU cloud infrastructure with high-performance hardware, and the frontend is served from a global content delivery network for fast load times anywhere in the world.

The Technology Stack

For those interested in the technical foundations, here is what powers NyayAI at a high level:

Component	Technology
AI Model	Qwen-3 4B, fine-tuned with LoRA on Indian legal instruction data
Embedding Model	BGE-M3 multilingual model for semantic search
Vector Database	FAISS with 354,293 indexed document chunks
Backend	FastAPI on Modal (serverless GPU cloud with L4 GPUs)
Frontend	Next.js 16 deployed on Vercel
Streaming	Server-Sent Events for real-time token streaming

Every component has been chosen deliberately — for performance, for cost efficiency, and for scalability. This is not a stack assembled from tutorials. It is a stack engineered for production-grade legal AI.

The Vision: Where NyayAI is Headed

What exists today is the foundation. The vision is much larger.

Multilingual legal access is at the top of the roadmap. India has 22 officially recognized languages and hundreds of dialects. The law should be accessible in all of them. We are working toward a future where a farmer in Tamil Nadu can ask a legal question in Tamil and receive an accurate, sourced answer — not a rough translation, but a genuine legal response in their own language.

High Court judgment coverage is the next major expansion of the corpus. India has 25 High Courts, each with its own body of case law. Adding High Court judgments will dramatically expand NyayAI's coverage and make it relevant for a much wider range of legal questions.

Tribunal and district court coverage will follow. Specialized tribunals — NCLT, NCLAT, ITAT, SAT, NGT, and others — handle an enormous volume of cases in specialized domains. District courts are where most litigation begins. Covering these courts will make NyayAI comprehensive.

Legal document drafting is a natural extension. Once the system understands the law deeply enough, it can assist in drafting legal notices, petitions, contracts, and other documents — grounded in actual legal provisions and precedents.

Case outcome prediction is an ambitious but achievable goal. By analyzing patterns in historical judgments — how similar cases were decided, which arguments succeeded, which factors were decisive — NyayAI can provide probabilistic assessments of likely outcomes.

Lawyer workflow tools will transform how legal professionals work. Brief generation, argument builders, precedent chains, counter-argument analysis — these are tools that can save lawyers hours of work on every case.

Vernacular access for citizens is perhaps the most important long-term goal. Most Indians are not lawyers. They are citizens who need to understand their rights, their obligations, and their options. NyayAI should be accessible to them — in their language, at their level of understanding, at a price they can afford.

API access for legal tech platforms will allow other developers and companies to build on top of NyayAI's infrastructure. The legal corpus, the retrieval engine, and the AI model can serve as the foundation for an ecosystem of legal technology applications.

The Competitive Landscape

I am often asked: "What if OpenAI builds this? What if Anthropic enters the Indian legal market? What if some well-funded Bengaluru startup beats you to it?"

These are fair questions. Here is my honest answer:

Even if all of them enter this space — and some of them will — most Indian courts, lawyers, and litigants STILL do not have AI-assisted research, retrieval systems, legal summarization, vernacular access, or affordable tooling.

That gap is not going to be filled by one company. The Indian legal system is enormous — 50 million pending cases, millions of legal professionals, 1.4 billion citizens. There is room for multiple players, and the market is so underserved that even modest penetration represents significant impact.

But more importantly, NyayAI has advantages that are difficult to replicate:

Domain specificity. We are not trying to be good at everything. We are trying to be the best at one thing: Indian law.
Curated corpus. Our legal corpus is not scraped from the internet. It is carefully curated, cleaned, and structured for legal AI applications.
Fine-tuned model. Our AI model is not generic. It has been trained specifically on Indian legal reasoning.
Ground-up architecture. Every component of the system — from the embedding pipeline to the retrieval engine to the user interface — has been designed for legal use cases.

NyayAI is purpose-built to fill a gap that generic tools cannot fill and that existing legal platforms have not addressed. That is our moat, and we are deepening it every day.

Progress So Far

NyayAI is approximately 75% complete on the journey from concept to market-ready product.

What we have already crossed:

✅ Data acquisition — sourcing 75 years of Supreme Court judgments, 858 Central Acts, and the Constitution
✅ Data cleaning and structuring — normalizing 1.52 GB of messy legal text into machine-readable format
✅ Document chunking — breaking legal documents into semantically meaningful segments
✅ Embedding generation — converting legal text into high-dimensional vector representations
✅ Retrieval engine — building semantic search over 354,293 document chunks
✅ AI model fine-tuning — training on Indian legal instruction pairs
✅ Inference serving — deploying the model on GPU infrastructure with streaming capabilities
✅ RAG pipeline — grounding AI responses in retrieved source text
✅ Streaming interface — real-time, word-by-word response delivery
✅ Citation grounding — linking every answer to verifiable sources
✅ Systems optimization — latency reduction, cost efficiency, scaling
✅ Frontend UX — professional, responsive, production-quality interface
✅ Global deployment — live and accessible

What remains:

🔲 Trust and reliability — ensuring consistent accuracy across edge cases
🔲 Distribution — reaching lawyers, law firms, courts, and citizens
🔲 Onboarding — making the first-use experience seamless
🔲 User retention — building habits and workflows around NyayAI
🔲 Legal partnerships — collaborating with bar associations, law schools, and legal aid organizations
🔲 Monetization — developing sustainable pricing models
🔲 Sales — building a go-to-market engine
🔲 Adoption loops — creating viral and referral mechanisms
🔲 Consistency — ensuring quality at scale

The hardest part of building a product is not the technology. It is everything that comes after — trust, distribution, adoption, and sustainability. We are now entering that phase.

About the Founder

NyayAI is built by Ashish Raj — solo founder, architect, and builder.

Every component of this platform — from the data pipelines that process raw legal text, to the model training infrastructure, to the retrieval engine, to the streaming backend, to the frontend interface you interact with — was built by one person.

I do not say this to boast. I say it because it speaks to conviction. When you believe that every Indian deserves access to justice, you do not wait for a team, a budget, or permission. You build.

I believe that the right AI, applied to the right domain, with the right data, can transform access to justice in India. Not incrementally. Fundamentally.

That is what NyayAI is. That is what I am building. And I am just getting started.

"Justice delayed is justice denied. But justice inaccessible is justice that never existed at all. NyayAI is being built to change that — one query, one citation, one answer at a time."

— Ashish Raj, Founder, NyayAI

NyayAI: Building an AI Legal Assistant for 1.4 Billion People — A Technical Deep Dive

Ashish Raj — Mon, 25 May 2026 07:02:24 +0000

I'm building a startup to make Indian law accessible to every lawyer, law student, and citizen in the country. Here's the technical story of how I went from zero to a working prototype — training a foundation model from scratch, fine-tuning on 4,000 instruction pairs, building a production-ready RAG pipeline, and shipping a premium SaaS product — all as a solo founder.

The Problem

India has 1.4 billion people and roughly 50 million active legal cases pending in its courts. Lawyers spend hours — sometimes days — digging through bare acts, constitutional articles, and decades of Supreme Court judgments just to find relevant precedents for a single case. The Indian legal system operates across 25+ High Courts, hundreds of tribunals, and a Supreme Court that has delivered judgments since 1950. The sheer volume is staggering.

And yet, the tooling available to lawyers is stuck in 2005. Paid databases like SCC Online and Manupatra charge thousands per month and still require manual keyword searches. Free resources like Indian Kanoon are search-only — no summaries, no analysis, no drafting. Generic AI tools like ChatGPT hallucinate case names, invent sections that don't exist, and have no depth in Indian law.

I wanted to change that.

NyayAI (न्याय = justice in Sanskrit) is an AI-powered legal assistant that understands Indian law — not superficially, but deeply. It can look up any section of any central act, summarize Supreme Court judgments, answer complex legal questions with grounded citations, and eventually draft legal documents. Think of it as ChatGPT, but one that actually passed the bar exam for Indian law.

Why Not Just Use ChatGPT?

This is the question I get asked most often. The answer is simple: a general-purpose model is broad intelligence; NyayAI is domain infrastructure for Indian law.

A general-purpose model like ChatGPT:

Does not maintain a live, structured legal retrieval index internally
Cannot guarantee exact citations from 43,000+ judgments
May compress or approximate precedent chains
May hallucinate paragraph numbers or holdings occasionally
Is optimized broadly across all domains, not specifically for Indian jurisprudence

NyayAI is specifically engineered for:

Indian legal retrieval — semantic search over the full corpus of Supreme Court judgments
Citation-grounded answers — every response is backed by actual legal text, not model memory
Statute + precedent linking — connecting Constitutional articles, Central Acts, and case law
Structured metadata retrieval — case title, bench, citation number, year, disposal type
Legal-domain-specific RAG — retrieval-augmented generation tuned for jurisprudence

The analogy is precise: GitHub Copilot is better than raw autocomplete for coding. Bloomberg exists despite Google. Westlaw exists despite search engines. NyayAI exists because Indian law deserves its own intelligence layer.

This blog post is a technical deep dive into everything I've built — the data pipelines, the model architecture decisions, the training infrastructure, the RAG pipeline, the production frontend, and the results. Every number, every decision, every failed experiment is documented here.

Phase 0: The 103M Parameter Experiment (The Learning Phase)

Before touching any pretrained model, I wanted to understand transformers at the deepest level. Not "import transformers and call .fit()" — I mean implementing a GPT-style transformer from scratch in PyTorch.

Architecture

I built a decoder-only transformer with the following specifications:

Parameter	Value
Total Parameters	103,457,280 (~103M)
Layers	9
Attention Heads	12
Embedding Dimension	768
Context Window	512 tokens
Vocabulary Size	50,257 (GPT-2 tokenizer)
Output Head	Weight-tied with embedding layer

The model was trained on 269 million tokens (1.25 GB) of Indian legal text — the same corpus I'd later use for the production pipeline. Training ran on NVIDIA A100 GPUs via Modal for 2 epochs across 59,000 gradient steps.

Results

Metric	Value
Final Validation Loss	2.46
Perplexity	11.7
Training Time	~8 hours

A perplexity of 11.7 on legal text means the model learned the structure and vocabulary of Indian legal language reasonably well. It could generate coherent legal-sounding text, but it was not a useful model — it had no instruction-following capability and no factual grounding. It was a learning exercise, and it served its purpose brilliantly.

Key Takeaway: Building a transformer from scratch taught me more about attention mechanisms, positional encoding, loss landscapes, and gradient dynamics than any course or paper ever could. If you're serious about ML, I strongly recommend doing this at least once.

Phase 1: Data Acquisition — The Foundation of Everything

A model is only as good as its data. For NyayAI, I needed three categories of legal text:

The Constitution of India — the supreme law, 395+ articles
Central Acts (Bare Acts) — the 858 laws passed by Parliament
Supreme Court Judgments — 75 years of case law (1950–2025)

1A. The Constitution of India

Source: A structured JSON file containing all articles with metadata (article number, title, description).

Pipeline: A straightforward JSON-to-text converter that:

Parses each article from the JSON
Cleans escaped newlines and normalizes whitespace
Preserves repealed articles with notation
Formats as structured text with Article N — Title headers
Separates each article with <|endoftext|> tokens for clean document boundaries

Output:

Metric	Value
Articles Processed	395+ (including amendments)
File Size	502 KB
Estimated Tokens	~106,000

The Constitution is small but dense — every article matters. The Preamble alone is one of the most frequently cited legal texts in Indian jurisprudence.

1B. Central Acts (858 Bare Acts)

This was significantly more complex. India has 858 central acts in force, ranging from the Indian Penal Code (1860) to the Digital Personal Data Protection Act (2023). These were stored as deeply nested JSON files with a schema that included:

Act Title, Act ID, Enactment Date, Act Definition
├── Chapters/Parts
│   ├── Sections
│   │   └── Paragraphs (strings or nested dicts with text/contains)
│   └── Subheadings
│       └── Sections
├── Schedules, Annexures, Appendix, Forms
└── Footnotes

Pipeline: A recursive JSON traversal engine that:

Handles BOM encoding — many Indian government JSON files contain a byte-order mark
Recursively extracts paragraphs — handles arbitrarily nested text/contains structures with proper indentation
Cleans legislative artifacts — removes footnote reference numbers, strips decorative markers
Sorts sections numerically — a custom sort function ensures Section 2 comes before Section 10
Processes chapters, subheadings, schedules, annexures, and footnotes — preserving the full hierarchical structure
Outputs with <|endoftext|> boundaries between each act

Output:

Metric	Value
Acts Processed	858
File Size	29.9 MB
Total Words	~5,076,000
Estimated Tokens	~6,600,000

1C. Supreme Court Judgments (1950–2025)

This was the heavy lift — and the most valuable data. The Supreme Court of India has delivered tens of thousands of judgments over 75 years. I sourced these from the AWS Open Data Registry (s3://indian-supreme-court-judgments), a public bucket containing judgment PDFs and metadata JSONs organized by year.

Step 1: Download

Uses boto3 with unsigned requests (public bucket, no auth needed)
Downloads English judgment tar files and metadata tar files for each year (1950–2026)
Implements resume support — skips files that already exist with correct size
Progress logging with download speed tracking

Step 2: Extract & Process

This is the most complex pipeline in the entire project. It:

Extracts metadata tars — unpacks year-by-year JSON metadata files
Parses metadata HTML — each judgment's metadata is stored as raw HTML. A dedicated parser extracts:
- Case title (petitioner vs respondent)
- Judges/Coram
- Decision date
- Case number
- Bench size
- Citation
- Disposal nature
Extracts text from PDFs — uses PyMuPDF (fitz) to extract text from judgment PDFs, then cleans:
- Page headers/footers ("SUPREME COURT REPORTS", standalone page numbers)
- Excessive whitespace
- Year-only lines (standalone "1950", "2023", etc.)
Matches PDFs to metadata — correlates each PDF with its extracted case metadata by path key
Formats each judgment as a structured document with a header block (title, citation, case number, date, bench, disposal) followed by the full judgment text
Processes year-by-year — streams output to avoid loading 1.5 GB of text into memory at once

Output:

Metric	Value
Judgments Processed	43,324
File Size	1.49 GB (1,588,861,395 bytes)
Total Words	~261,000,000
Estimated Tokens	~339,300,000
Time Span	1950–2025 (75 years)

Total Corpus Summary

Source	File Size	Tokens (est.)
Constitution of India	502 KB	~106K
Central Acts (858 acts)	29.9 MB	~6.6M
SC Judgments (43,324 cases)	1.49 GB	~339.3M
Total	~1.52 GB	~346 Million

This is a genuinely massive legal corpus — 346 million tokens of structured, cleaned Indian legal text spanning 75 years of Supreme Court jurisprudence, the entire Constitution, and every central act in force.

Phase 1.5: Synthetic Instruction Dataset Generation

A language model that can continue legal text is interesting but not useful. To make it follow instructions — answer questions, summarize cases, compare sections — I needed an instruction-response dataset.

Creating thousands of high-quality legal Q&A pairs by hand was not feasible. Instead, I built a synthetic data generation pipeline using Google's Gemini API.

The Approach

Random chunk sampling — for each batch, randomly select a ~40,000 character chunk from one of the three source files, with a weighted distribution:
- 60% Supreme Court judgments (largest, most diverse)
- 30% Central Acts (statute-heavy, structured)
- 10% Constitution (fundamental, frequently referenced)
Structured prompting — each chunk is sent to gemini-3.1-flash-lite with a carefully crafted prompt that enforces:
- No hallucination — responses must be based strictly on the provided text excerpt
- Diversity in length and complexity — each batch of 5 pairs follows a prescribed format:
  - Task 1: Very Long (3-4 paragraph comprehensive summary/brief)
  - Task 2: Medium (legal argument/analysis)
  - Task 3: Medium (comparison of concepts)
  - Task 4: Short (direct factual question)
  - Task 5: Short (yes/no client question with explanation)
- Structured output — uses Pydantic models with response_mime_type: application/json for reliable parsing
Incremental saving — pairs are appended to a JSONL file as they're generated, with a running count. Supports resume (checks existing pair count on startup).
Rate limiting — 4-second sleep between requests to respect the free tier (15 RPM).

Output

Metric	Value
Generated Pairs	~4,000
File Size	2.09 MB
Source Distribution	60% judgments, 30% acts, 10% constitution
Generation Model	Gemini 3.1 Flash Lite
Cost	$0 (free tier API)

Training Data Distribution Analysis

After generation, I analyzed the response length distribution — this turned out to be a critical insight for understanding model behavior later:

Stat	Value
Median response	110 words (~150 tokens)
25th percentile	45 words
75th percentile	147 words (~200 tokens)
Max response	367 words (~490 tokens)
Under 100 words	45% of all training data
100-200 words	37%
200-400 words	18%
400+ words	0%

This distribution matters enormously: the model will learn to produce responses at the length distribution it was trained on. More on this in Phase 4B.

The critical insight here: the quality of your instruction data matters far more than quantity. The original Stanford Alpaca paper used only 52K pairs to teach instruction-following to LLaMA. For a domain-specific model, 2,000-4,000 high-quality, grounded pairs are more than enough — as long as they're diverse in task type and faithful to the source material.

Phase 2: Fine-Tuning — Teaching the Model Indian Law

With data in hand, it was time to take a state-of-the-art pretrained model and teach it to be an Indian legal expert.

Model Selection: Qwen-3 4B Instruct

After evaluating several sub-6B parameter models (Phi-4-mini, SmolLM3-3B, Gemma-3n-E2B), I chose Qwen-3 4B Instruct (2507 variant) for several reasons:

Factor	Why Qwen-3 4B
Reasoning	Exceptional chain-of-thought and instruction following
Multilingual	Strong Hindi support (critical for Indian legal market)
Architecture	Modern optimizations, efficient attention
Ecosystem	Massive HuggingFace community, well-documented
License	Apache 2.0 — fully commercial use
Size	4B parameters — fits in a single L4 GPU (24GB) in bfloat16

Training Infrastructure

Everything runs on Modal — a serverless GPU cloud that lets you define your entire training pipeline in a single Python file and run it with one command. The entire training pipeline — from data loading to checkpoint saving — executes remotely on Modal. Checkpoints are saved to a Modal Volume and automatically downloaded to my local machine after each epoch.

LoRA: Training Smart, Not Expensive

Fine-tuning all 4 billion parameters would require multiple GPUs and cost hundreds of dollars. Instead, I implemented LoRA (Low-Rank Adaptation) from scratch — no HuggingFace PEFT library, no Unsloth, no shortcuts.

How LoRA Works

Instead of updating the full weight matrix W (size d × d), LoRA decomposes the update into two small matrices:

W' = W + α(A × B)
where A is (d × r) and B is (r × d), and r << d

For rank r=16 and dimension d=768, instead of updating 589,824 parameters per layer, you're updating 16×768 + 16×768 = 24,576 parameters — a 24x reduction.

Implementation

class LORALayer(torch.nn.Module):
    def __init__(self, in_dim, out_dim, rank, alpha):
        super().__init__()
        self.A = torch.nn.Linear(in_dim, rank, bias=False)
        self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))
        self.alpha = alpha

    def forward(self, x):
        return self.alpha * (self.A(x) @ self.B)

class LinearWithLoRA(torch.nn.Module):
    def __init__(self, linear, rank, alpha):
        super().__init__()
        self.linear = linear
        self.lora = LORALayer(linear.in_features, linear.out_features, rank, alpha)

    def forward(self, x):
        return self.linear(x) + self.lora(x)

The B matrix is initialized to zeros, so at the start of training, LoRA(x) = α × (A(x) @ 0) = 0. The model starts exactly where the pretrained model left off — no disruption. As training progresses, the LoRA layers learn domain-specific adaptations while the base model stays frozen.

Target Modules

LoRA adapters were injected into the attention layers only:

lora_target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]

Hyperparameters

Parameter	Value	Rationale
LoRA Rank	16	Sweet spot: enough capacity for domain adaptation without overfitting on ~4K pairs
LoRA Alpha	32	α/r = 2.0 scaling factor — standard choice
Peak Learning Rate	2e-5	Conservative — avoiding catastrophic forgetting of base model knowledge
Minimum Learning Rate	2e-6	10x decay from peak
Warmup Steps	50	Quick ramp to prevent early instability
Batch Size	4	Fits in L4 VRAM with gradient checkpointing
Max Sequence Length	8,192	Full context window of Qwen-3
Weight Decay	0.1	Standard regularization
Gradient Clipping	1.0 (max norm)	Prevents exploding gradients on long legal sequences
Optimizer	AdamW	Only over LoRA parameters
Precision	bfloat16	Native on L4, no precision loss for this scale
Epochs	2	Sufficient for convergence on this dataset size

Parameter Efficiency

Category	Count
Total Model Parameters	~4,000,000,000
Frozen (Base Model)	~3,988,200,000
Trainable (LoRA)	~11,800,000
Parameter Ratio	~0.30%

We're training less than 0.3% of the model's parameters. The LoRA adapter checkpoint is ~135 MB — compared to the full model's ~8 GB in bfloat16.

Data Formatting: ChatML

Every instruction-response pair is formatted in ChatML (the template Qwen expects):

<|im_start|>system
You are an expert Indian Legal Assistant.<|im_end|>
<|im_start|>user
What are the key provisions of Section 14 of the Hindu Succession Act?<|im_end|>
<|im_start|>assistant
Section 14 of the Hindu Succession Act, 1956, is a landmark provision...<|im_end|>

Custom Collation: Dynamic Batch Padding

Rather than padding all sequences to the maximum model length (8,192 tokens), I implemented dynamic batch padding — each batch is padded only to the length of its longest sequence. This saves enormous amounts of compute. If a batch's longest sequence is 1,200 tokens, we're processing 1,200 × 4 = 4,800 tokens instead of 8,192 × 4 = 32,768 tokens. On average, this reduces compute by ~70-80%.

Learning Rate Schedule: Cosine with Linear Warmup

Linear warmup (0 → 2e-5 over 50 steps) — prevents early training instability
Cosine decay (2e-5 → 2e-6 over remaining steps) — smooth convergence without sharp drops

Memory Optimization: Gradient Checkpointing

With 4B parameters in bfloat16, the model alone takes ~8GB of VRAM. Add optimizer states, gradients, and activations for 8,192-token sequences, and you blow past 24GB easily. Gradient checkpointing trades ~30% more compute time for ~40% VRAM savings — the difference between fitting and OOM.

Fault-Tolerant Training: The Generator Pattern

Training on cloud GPUs can fail for many reasons — preemption, network issues, timeouts. The training loop uses Python's generator pattern (yield) to stream results back to the local machine after each epoch. This means even if training crashes after epoch 1, I already have the checkpoint downloaded locally.

Training Results

Training ran for 2 full epochs on an NVIDIA L4 GPU (24GB VRAM) via Modal.

Metric	Epoch 1 End	Epoch 2 End (Final)
Training Loss	~1.05	~0.69
Validation Loss	~1.00	~0.92
Learning Rate	~1.2e-5 (mid-decay)	~2e-6 (minimum)
Tokens Processed	~4.5M	~9.0M
Global Steps	~850	~1,700

Key Observations

Smooth convergence — no loss spikes, no instability. The warmup + cosine schedule + gradient clipping combination worked perfectly.
No overfitting — validation loss tracked training loss closely throughout. The gap widened slightly in epoch 2 (0.69 vs 0.92), which is expected and healthy.
Rapid initial learning — the steepest loss drop happened in the first 200 steps of epoch 1, as the model quickly adapted to the legal domain's vocabulary and style.
Diminishing returns in epoch 2 — most of the learning happened in epoch 1. Epoch 2 provided refinement but the marginal improvement was smaller.

Phase 3: The Production RAG Pipeline — Architecture, Sharding, & Serving

A fine-tuned model knows how to talk like a legal expert, but it doesn't remember specific facts. When a lawyer asks "What does Section 34 of the Indian Trusts Act say?", a model might generate something that sounds legally plausible but is entirely fabricated.

To solve this, I designed and built a production-grade, highly optimized RAG (Retrieval-Augmented Generation) pipeline. This lookup mechanism allows our fine-tuned Qwen model to query a massive vector database of Indian law, extract the exact legal provisions, and generate answers strictly grounded in the source material with pinpoint citations.

3A. LoRA Adapter Merging

Running a model with active LoRA weights in production adds computational overhead and complicates serving. To achieve maximum inference speed and simplify deployment, I mathematically blend the LoRA weights directly into the base Qwen-3 4B parameters:

$$W_{\text{merged}} = W_{\text{base}} + \frac{\alpha}{r} (A \times B)$$

Result: Fused 144 adapter projection layers in exactly 20.4 seconds. The final standalone model (~7.5 GB in bfloat16 precision) was saved directly to the persistent Modal Volume.

3B. Structure-Aware Legal Chunking

Legal documents have natural, highly structured segmentations (articles, sections, subsections). Naive chunking (e.g., splitting every 500 characters blindly) splits legal clauses in half, completely ruining retrieval precision.

I built a structure-aware chunking pipeline that parses the three source document types into structured chunks while preserving critical legal metadata mappings:

Constitution of India: Split by Article bounds → 468 chunks (average 1,025 characters).
Central Acts: Split recursively by Section bounds → 23,152 chunks (average 1,364 characters).
Supreme Court Judgments: Split by structured paragraphs, with metadata headers (case title, citation, bench, year) prepended to each chunk → 330,673 chunks (average 4,756 characters).

Output: 354,293 chunks compiled into a single 1.6 GB file. Each chunk contains its text, chunk_id, and a metadata dictionary mapping its original source attributes (e.g., article_number, act_title, section, case_title, citation, bench, year).

3C. Massively Parallel GPU Map-Reduce Embedding

Generating vector embeddings for 354,293 documents using a state-of-the-art multi-lingual model (BGE-M3) would take days on a single machine. To solve this, I built a highly distributed Map-Reduce pipeline using Modal.

graph TD
    A[354,293 chunks] --> B[Coordinator Function]
    B -->|Split into 32 shards| C[Shard Inputs]

    C -->|Shard 0| D1[L4 GPU Worker 1]
    C -->|Shard 1| D2[L4 GPU Worker 2]
    C -->|...| D3[L4 GPU Worker ...]
    C -->|Shard 31| D4[L4 GPU Worker 32]

    D1 -->|Embed FP16| E1[11,000 vectors]
    D2 -->|Embed FP16| E2[11,000 vectors]
    D3 -->|Embed FP16| E3[... vectors]
    D4 -->|Embed FP16| E4[11,000 vectors]

    E1 --> F[Reduce / Concatenate]
    E2 --> F
    E3 --> F
    E4 --> F

    F --> G[(FAISS Index FlatIP <br> 354,293 x 1024)]
    F --> H[(SQLite chunk_lookup.db)]

The Map Phase: The coordinator divides the 354K chunks into 32 shards (~11,000 chunks per shard). Modal automatically spins up 32 parallel L4 GPU containers in the cloud simultaneously.
Pre-Caching & Instant Boot: The BGE-M3 model weights are baked directly into the Docker image layer, bypassing HuggingFace downloads and enabling the GPU servers to boot instantly.
FP16 Inference: Each worker runs native PyTorch float16 inference over its 11,000 texts, generating normalized dense embeddings in a fraction of the time.
The Reduce Phase: The coordinator gathers the 32 output matrices, concatenating them in chronological order into a single dense matrix of shape (354293, 1024).
FAISS Index Compilation: The combined embeddings are fed into a FAISS IndexFlatIP (Cosine similarity) index and saved. Simultaneously, a SQLite lookup database is generated on the volume.
Compute Time: The entire parallel sharding execution finished in under 20-30 minutes of total wall time.

3D. Production FastAPI Serving & Optimizations

To serve the RAG assistant, I built an extremely optimized FastAPI server hosted on Modal. It loads the merged Qwen model and BGE-M3 on a single cost-effective L4 GPU.

1. Zero-RAM SQLite Lookup Database (Startup Optimization)

The Problem: Reading the 1.6 GB chunk lookup JSON into container memory on boot takes almost 2 minutes and consumes 1.6 GB of RAM.
The Solution: On first startup, the server streams the JSON file line-by-line and compiles a local SQLite database directly on the persistent volume (took 92.6s). On subsequent boots, the JSON is completely bypassed.
The Result: The server opens a thread-safe SQLite connection instantly on boot (0.001 seconds) and consumes 0 MB of startup RAM overhead.

2. VRAM Autocasting & Thread-Safe Real-time Streaming

Autocasting: Inside the generation thread, both token lookup and model generation are wrapped in torch.inference_mode() and torch.autocast(device_type="cuda", dtype=torch.bfloat16) to guarantee zero memory spikes.
ASGI Protection: Real-time token streaming is exposed via Server-Sent Events (SSE) at /api/ask/stream. Because LLM token generation is CPU/GPU bound, running it synchronously inside an async FastAPI server freezes the async event loop. I wrapped the TextIteratorStreamer inside a separate native OS Thread and fed tokens into a synchronous streaming generator.
Strict EOS Enforcement: The system dynamically extracts the <|im_end|> and <|endoftext|> token IDs at tokenizer boot to strictly enforce early stopping and prevent hallucinations.

3. Absolute Cost Safety

The server uses min_containers=0. When idle, it scales down to zero GPU containers, costing exactly $0.00 in hosting fees. Cold start boots in ~10 seconds.

3E. Verification & End-to-End Test Results

Both endpoints were verified against the active server. The results are spectacular and highly accurate:

1. Blocking API Endpoint (`/api/ask`)

Query: "What does Article 21 of the Indian Constitution guarantee?"
Status: 200 OK
Total Latency: 5.34 seconds
Generated Answer: > Article 21 guarantees the right to life and personal liberty. The Supreme Court has interpreted this right expansively, noting that it is not limited to mere survival but encompasses the right to live with dignity. This includes the right to privacy, which is viewed as an inalienable component of personal liberty.
Sources Used:
1. [SC_JUDGMENTS] Supreme Court: K.S. Puttaswamy v. Union of India (2017)
2. [SC_JUDGMENTS] Supreme Court: Common Cause v. Union of India (2017)
3. [SC_JUDGMENTS] Supreme Court: X v. Union of India (2023)

2. Streaming API Endpoint (`/api/ask/stream`)

Query: "What are the grounds for divorce under the Hindu Marriage Act?"
Status: 200 OK
Stream Event 1 (Metadata Block): Source citations with full case metadata
Stream Event 2+ (Word-by-Word Tokens): Real-time legal analysis streaming

Phase 4A: The Full-Stack SaaS Product

With a working RAG backend, I built a complete production-grade web application — not a demo, not a Gradio wrapper, but a real SaaS product with authentication, streaming, and a premium UI.

Technology Stack

Layer	Technology
Frontend Framework	Next.js 16 (App Router)
Deployment	Vercel (via CLI, no GitHub push required)
Styling	CSS Modules with custom design system
Fonts	Playfair Display (headings) + Inter (body) via Google Fonts
Icons	Lucide React
Backend Proxy	Next.js API Routes → Modal FastAPI
Authentication	Access-code gate with server-side cookie validation

Design System

The UI uses a dark professional theme with deep navy (#0F1B2D) + gold (#C89D4A) branding — deliberately chosen for extended legal research sessions. No glassmorphism. Minimal, authoritative, and clean.

Token	Value	Usage
`--navy`	`#0F1B2D`	Primary background
`--navy-light`	`#162337`	Card/surface backgrounds
`--gold`	`#C89D4A`	Accent, branding, active states
`--white`	`#EAEAEA`	Primary text
`--gray-300`	`#B0B8C1`	Secondary text

Architecture Flow

graph LR
    A[User Browser] -->|HTTPS| B[Vercel CDN]
    B -->|Auth Cookie| C[Next.js API Route]
    C -->|SSE Stream| D[Modal FastAPI]
    D -->|FAISS Query| E[(Vector Index)]
    D -->|SQLite Lookup| F[(Chunk DB)]
    D -->|Qwen-3 4B| G[L4 GPU]
    G -->|Token Stream| D
    D -->|SSE Response| C
    C -->|Stream to Client| A

Key Components

Access Gate — access-code authentication with server-side cookie validation. Protects the chat interface from unauthorized access.
Chat Interface — real-time streaming chat with auto-scroll, message bubbles, and a loading state that progresses through multiple stages ("Searching 354,293 legal documents..." → "Analyzing relevant precedents..." → "Constructing legal context..." → "Warming up GPU inference engine...").
Citation Cards — each source citation is rendered as an expandable card showing:
- Source type badge (SC Judgment / Central Act / Constitution)
- Case title (with intelligent fallback extraction from chunk text)
- Year, citation number, bench composition
- Full metadata grid when expanded
- Actual chunk text (first 300 characters)
Collapsible Citations — citations are grouped by source type with a summary bar: "6 SC Judgments · 1 Constitution · 1 Central Act". Collapsed by default to keep the focus on the answer.
Confidence Bar — displays: "✓ 8 sources retrieved · Avg relevance: 87%"
Sample Prompts — curated legal questions on the empty state, tuned for strong demo performance.
Branded Favicon — custom SVG: gold "N" monogram with balanced scales of justice on deep navy background.

Phase 4B: The Refinement Layer — Production-Grade Tuning

After the initial deployment, I systematically addressed every production issue. This phase was the difference between "it works" and "it works well."

System Prompt Engineering

The original system prompt was 5 generic lines. I rewrote it into a 20-line structured instruction set that forces:

Exact case name citations (no "Supreme Court Judgment" placeholders)
Chronological ordering for historical/evolution queries
Bullet points for distinct legal holdings
No repetition across paragraphs
Senior legal researcher tone

Hierarchical Response Modes

A critical learning: the model was fine-tuned on responses with a median length of 110 words. It learned to hit EOS (end of sentence) at ~150-180 tokens regardless of max_new_tokens. The system prompt alone couldn't override this trained behavior.

The solution: hierarchical prompting with three modes.

Mode	`min_new_tokens`	System Prompt Instruction	Use Case
⚡ Concise	0 (no floor)	"Brief, direct answer in 2-4 sentences"	Quick factual lookups
📖 Detailed	150	"Detailed analysis with case references, chronological ordering"	Standard legal questions
🎓 Research	350	"Full legal research memo: case-by-case breakdown, reasoning, holdings, evolution, current position"	Deep analysis, investor demos

Each mode dynamically adjusts both the system prompt AND the min_new_tokens parameter in model.generate(). The user sees three pill buttons above the input field.

The insight: max_new_tokens is a ceiling, not a target. It says "generate at most this many tokens." But the model stops when it hits an EOS token. min_new_tokens tells the model: "you cannot stop generating until you've produced at least N tokens." Combined with a structured prompt that asks for detailed analysis, the model fills those extra tokens with actual substance.

Source-Aware Retrieval Routing

The original RAG pipeline returned the top-k nearest vectors from FAISS regardless of query intent. If you asked about Article 21 (Constitution), you might get 8 SC Judgment chunks and zero Constitution chunks — because judgment text is more verbose and often embeds better.

The fix: _enforce_source_diversity()

Query intent detection — regex-based analysis detects if the query targets Constitution articles, Central Acts, or SC Judgments
Over-retrieval — FAISS retrieves top_k * 2 candidates (16 instead of 8)
Intelligent reranking — if the query targets Constitution but results are all judgments, Constitution chunks are boosted in the reranking

Metadata-Grounded Citation Cards

A persistent bug: citation cards showed "Supreme Court Judgment" instead of the actual case title (e.g., "Pritam Singh v. The State"). The case_title metadata was sometimes missing from older chunks.

The fix: Two-layer fallback:

Backend now sends text_chunk (first 300 characters of chunk text) in the streaming sources payload
Frontend extracts the case title from the chunk text using regex: "Supreme Court of India — CASE TITLE" → "CASE TITLE"

Comprehensive Modal Logging

Every inference request now logs:

User query and parameters
Response mode and min_tokens configuration
FAISS retrieval distances and case titles for each chunk
Source routing decisions
Prompt token count
Full model output text
Per-stage latency (retrieval, generation, total)
Generated token count

All visible in the Modal dashboard for live monitoring.

Mobile Responsive Design

Citation cards that were visually dominant on mobile screens were redesigned:

Compact padding and smaller text
Single-column metadata grid (instead of 2-column)
Scrollable chunk text with max-height: 200px
Source badges at reduced size

Infrastructure Costs

Phase	GPU	Time	Cost
Data Processing	CPU only	~2 hours	$0 (local)
Synthetic QA Generation	None (API)	~6 hours	$0 (Gemini free tier)
Fine-Tuning (2 epochs)	L4 (24GB)	~2 hours	~$3-5 (Modal)
Embedding (32x sharded)	L4 × 32	~30 min	~$3-4 (Modal)
Frontend Hosting	—	—	$0 (Vercel free tier)
Backend Hosting (idle)	—	—	$0 (Modal scales to zero)
Total			< $10

Read that again. The entire pipeline — from raw legal text to a fine-tuned 4B parameter model with RAG, streaming, and a production SaaS frontend — cost less than a meal.

Technical Specs Summary

Component	Specification
Base Model	Qwen-3 4B Instruct (2507)
Merged Model	Standing `bfloat16` standalone weights (~7.5 GB)
Embedding Model	BAAI/bge-m3 (dense vector, FP16 precision)
FAISS Vector Index	IndexFlatIP (Cosine Similarity, 1024 dimensions)
Total Database Chunks	354,293 chunks (1.6 GB corpus)
Lookup Engine	Thread-safe local SQLite database
Server Framework	FastAPI (with SSE token streaming)
Concurrence Model	Native multi-thread worker with TextIteratorStreamer
API Endpoints	`/api/ask` (Blocking), `/api/ask/stream` (SSE Streaming)
Response Modes	Concise / Detailed / Research (hierarchical prompting)
Frontend	Next.js 16 (App Router) on Vercel
Authentication	Access-code gate with server-side cookies
Hosting Platform	Modal (Backend) + Vercel (Frontend)
GPU Target	NVIDIA L4 (24GB VRAM)
Production Scale	`min_containers=0` (Scales to zero when idle for $0.00/hr)
E2E Average Latency	~5 seconds for full answer / real-time for streaming
Total Build Cost	< $10

Lessons Learned

1. Data Quality > Data Quantity

4,000 carefully structured instruction pairs, generated from real legal text with strict anti-hallucination prompting, taught the model more than 50,000 sloppy pairs would have. The key was enforcing diversity in both task type (summaries, comparisons, Q&A, yes/no) and length (1 sentence to 4 paragraphs).

2. Training Data Distribution Dictates Model Behavior

The model's output length is not controlled by max_new_tokens — it's dictated by the distribution of response lengths in the training data. With a median training response of 110 words, the model consistently hits EOS at ~150-180 tokens. The fix isn't bigger max_new_tokens — it's either retraining with longer responses or using min_new_tokens with structured prompts.

3. Hierarchical Prompting is High ROI

Instead of a one-size-fits-all prompt, implementing response modes (Concise/Detailed/Research) with mode-specific system prompts and min_new_tokens floors gives users control over response depth. This was suggested during product critique and turned out to be the single highest-ROI improvement for user experience.

4. Source Diversity Matters More Than Raw Similarity

FAISS returns the most semantically similar chunks, but similarity ≠ utility. A Constitution query returning 8 judgment chunks (because judgments embed better) is technically correct but practically useless. Source-aware reranking that considers query intent dramatically improves answer quality.

5. Standalone Merged Models are Faster and Cleaner

Merging the LoRA weights directly into the base parameters completely eliminated inference-time adapter overhead, trimmed memory footprints, and allowed the base model to load at peak native speeds.

6. Bypass JSON in Production with SQLite

Loading large JSON files (1.6GB+) is a silent killer for cloud instances. SQLite dropped boot overhead from 2 minutes to 0.001 seconds while consuming 0 MB of startup RAM.

7. GPU Sharding for Rapid Large-Scale Embeddings

Attempting to embed 354,000+ texts sequentially is a nightmare. 32 parallel L4 GPUs via Modal allowed us to embed the entire dataset in ~20 minutes for under a few dollars.

8. Always Scale to Zero when Idle

For bootstrapped startups, min_containers=0 on serverless providers like Modal allows hosting a fully functional RAG prototype completely free of charge when idle.

9. Domain Infrastructure Beats General Intelligence

A general-purpose LLM is broad intelligence. NyayAI is domain infrastructure for Indian law. That's similar to how Bloomberg exists despite Google, or how Westlaw exists despite search engines. The value comes from Indian legal corpus specialization, retrieval grounding, citation accuracy, jurisprudence-focused indexing, and workflow optimization for lawyers.

The Stack

PyTorch · Modal · Qwen-3 4B · FAISS · BGE-M3 · SQLite · FastAPI · Next.js 16 · Vercel · Server-Sent Events · LoRA · Cosine Similarity

What's Already Built (75/100)

✅ Data acquisition · ✅ Cleaning · ✅ Chunking · ✅ Embeddings · ✅ Retrieval · ✅ Serving · ✅ Deployment · ✅ Fine-tuning · ✅ Streaming · ✅ Grounding · ✅ Systems optimization · ✅ UX · ✅ Source routing · ✅ Hierarchical prompting · ✅ Citation metadata · ✅ Frontend SaaS

What's Next (25/100)

🔲 Trust · 🔲 Distribution · 🔲 Onboarding · 🔲 User retention · 🔲 Legal partnerships · 🔲 Monetization · 🔲 Sales · 🔲 Adoption loops · 🔲 Reliability · 🔲 Consistency · 🔲 Multilingual access · 🔲 High Court coverage

Built with obsession by a solo founder who believes every Indian deserves access to justice — and that the right AI can make that happen.

DEV Community: Ashish Raj

NyayAI: AI-Powered Legal Intelligence for India

Making Indian law accessible, accurate, and affordable for 1.4 billion people.

The Problem We're Solving

What is NyayAI?

Why Not Just Use ChatGPT?

NyayAI's Core Features

1. Grounded Legal Answers

2. 354,293 Legal Documents Indexed

3. Real-Time Streaming Responses

4. Citation Cards with Metadata

5. Three Response Modes

6. Collapsible Citation Interface

7. Confidence Scoring

8. Source-Aware Routing

9. Dark Professional UI

10. Mobile Responsive

11. Secure Access Gate

What We Built — The Engineering Summary

The Technology Stack

The Vision: Where NyayAI is Headed

The Competitive Landscape

Progress So Far

About the Founder

NyayAI: Building an AI Legal Assistant for 1.4 Billion People — A Technical Deep Dive

The Problem

Why Not Just Use ChatGPT?

Phase 0: The 103M Parameter Experiment (The Learning Phase)

Architecture

Results

Phase 1: Data Acquisition — The Foundation of Everything

1A. The Constitution of India

1B. Central Acts (858 Bare Acts)

1C. Supreme Court Judgments (1950–2025)

Total Corpus Summary

Phase 1.5: Synthetic Instruction Dataset Generation

The Approach

Output

Training Data Distribution Analysis

Phase 2: Fine-Tuning — Teaching the Model Indian Law

Model Selection: Qwen-3 4B Instruct

Training Infrastructure

LoRA: Training Smart, Not Expensive

How LoRA Works

Implementation

Target Modules

Hyperparameters

Parameter Efficiency

Data Formatting: ChatML

Custom Collation: Dynamic Batch Padding

Learning Rate Schedule: Cosine with Linear Warmup

Memory Optimization: Gradient Checkpointing

Fault-Tolerant Training: The Generator Pattern

Training Results

Key Observations

Phase 3: The Production RAG Pipeline — Architecture, Sharding, & Serving

3A. LoRA Adapter Merging

3B. Structure-Aware Legal Chunking

3C. Massively Parallel GPU Map-Reduce Embedding

3D. Production FastAPI Serving & Optimizations

1. Zero-RAM SQLite Lookup Database (Startup Optimization)

2. VRAM Autocasting & Thread-Safe Real-time Streaming

3. Absolute Cost Safety

3E. Verification & End-to-End Test Results

1. Blocking API Endpoint (/api/ask)

2. Streaming API Endpoint (/api/ask/stream)

Phase 4A: The Full-Stack SaaS Product

Technology Stack

Design System

Architecture Flow

Key Components

Phase 4B: The Refinement Layer — Production-Grade Tuning

System Prompt Engineering

Hierarchical Response Modes

Source-Aware Retrieval Routing

Metadata-Grounded Citation Cards

Comprehensive Modal Logging

Mobile Responsive Design

Infrastructure Costs

Technical Specs Summary

1. Blocking API Endpoint (`/api/ask`)

2. Streaming API Endpoint (`/api/ask/stream`)