<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ashish Raj</title>
    <description>The latest articles on DEV Community by Ashish Raj (@ashish_raj_04).</description>
    <link>https://dev.to/ashish_raj_04</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3950115%2Fb3b813b6-462e-4463-9a23-2c699f1e4694.jpg</url>
      <title>DEV Community: Ashish Raj</title>
      <link>https://dev.to/ashish_raj_04</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ashish_raj_04"/>
    <language>en</language>
    <item>
      <title>NyayAI: AI-Powered Legal Intelligence for India</title>
      <dc:creator>Ashish Raj</dc:creator>
      <pubDate>Mon, 25 May 2026 07:26:49 +0000</pubDate>
      <link>https://dev.to/ashish_raj_04/nyayai-ai-powered-legal-intelligence-for-india-lc4</link>
      <guid>https://dev.to/ashish_raj_04/nyayai-ai-powered-legal-intelligence-for-india-lc4</guid>
      <description>&lt;h3&gt;
  
  
  &lt;em&gt;Making Indian law accessible, accurate, and affordable for 1.4 billion people.&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;By Ashish Raj&lt;/strong&gt; — Founder, NyayAI&lt;br&gt;&lt;br&gt;
&lt;em&gt;May 2026&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem We're Solving
&lt;/h2&gt;

&lt;p&gt;India is the world's largest democracy. It is home to 1.4 billion people, one of the oldest continuous legal traditions on the planet, and a Constitution that is widely regarded as one of the most comprehensive ever drafted. And yet, for most Indians, the law remains a black box — expensive to access, slow to navigate, and almost impossible to understand without professional help.&lt;/p&gt;

&lt;p&gt;Consider this: India currently has over &lt;strong&gt;50 million pending court cases&lt;/strong&gt;. Fifty million. That number is not a typo. It is a crisis — a slow-moving, systemic failure that affects every citizen, every business, and every institution in the country. Cases languish for years, sometimes decades. Litigants exhaust their savings. Justice, in too many cases, is not denied outright — it is simply delayed until it becomes meaningless.&lt;/p&gt;

&lt;p&gt;Behind those 50 million cases are lawyers — hundreds of thousands of them — who spend hours, sometimes days, manually searching through case law. They sift through volumes of Supreme Court Reports, flip through annotated statutes, and cross-reference precedents by memory or by keyword. The process is slow, error-prone, and exhausting. A single legal research task that should take minutes can consume an entire afternoon.&lt;/p&gt;

&lt;p&gt;The tools that exist today are either &lt;strong&gt;expensive or inadequate&lt;/strong&gt;. Platforms like SCC Online and Manupatra are the industry standard, but they come with steep subscription fees that put them out of reach for solo practitioners, junior advocates, and law students. More importantly, they are fundamentally &lt;strong&gt;keyword-based search tools&lt;/strong&gt; — you type in a phrase, and you get back a list of documents that contain that phrase. There is no intelligence. No understanding. No synthesis.&lt;/p&gt;

&lt;p&gt;Free alternatives like Indian Kanoon have done admirable work in making legal text available online, but they remain &lt;strong&gt;search-only platforms&lt;/strong&gt; — no analysis, no summarization, no contextual understanding, no citation linking, no structured output. You search, you read, you figure it out yourself.&lt;/p&gt;

&lt;p&gt;And then there are the general-purpose AI tools — ChatGPT, Claude, Gemini, and others. They are extraordinary pieces of technology. I use them every day. But when it comes to Indian law, they are &lt;strong&gt;dangerously unreliable&lt;/strong&gt;. They hallucinate case names. They invent statutory sections that do not exist. They cite judgments with confident authority — judgments that were never delivered. They lack depth in Indian jurisprudence, and they have no mechanism to verify or ground their answers in actual legal text.&lt;/p&gt;

&lt;p&gt;The result is a painful paradox: &lt;strong&gt;India has one of the richest legal traditions in the world, and yet most of its citizens, lawyers, and courts operate without any AI-assisted research, retrieval, summarization, vernacular access, or affordable tooling.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That gap is not hypothetical. It is real, it is massive, and it affects millions of people every single day.&lt;/p&gt;

&lt;p&gt;NyayAI exists to close that gap.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is NyayAI?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;NyayAI&lt;/strong&gt; — the name comes from &lt;em&gt;न्याय&lt;/em&gt; (Nyāya), the Sanskrit word for justice — is an AI-powered legal assistant specifically engineered for Indian jurisprudence.&lt;/p&gt;

&lt;p&gt;Let me be precise about what that means, because the distinction matters.&lt;/p&gt;

&lt;p&gt;NyayAI is &lt;strong&gt;not a chatbot wrapper&lt;/strong&gt;. It is not a thin interface on top of a general-purpose language model. It is not a weekend hackathon project with a legal skin. It is &lt;strong&gt;domain infrastructure for Indian law&lt;/strong&gt; — purpose-built from the ground up to understand, retrieve, and reason over Indian legal text with a level of precision that generic tools simply cannot match.&lt;/p&gt;

&lt;p&gt;Think of it this way: &lt;strong&gt;Bloomberg exists for finance. Westlaw exists for American law. NyayAI is being built to serve that same function for Indian law.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At its core, NyayAI is grounded in a curated corpus of &lt;strong&gt;354,293 legal documents&lt;/strong&gt; spanning 75 years of Supreme Court judgments (from 1950 to 2025), &lt;strong&gt;858 Central Acts&lt;/strong&gt; of the Indian Parliament, and the &lt;strong&gt;complete Constitution of India&lt;/strong&gt; — including all amendments, schedules, and articles. Every answer NyayAI produces is traceable back to an actual legal source. Every citation is real. Every reference can be verified.&lt;/p&gt;

&lt;p&gt;This is not artificial intelligence that &lt;em&gt;sounds&lt;/em&gt; legal. This is artificial intelligence that &lt;em&gt;is&lt;/em&gt; legal — grounded, sourced, and verifiable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Not Just Use ChatGPT?
&lt;/h2&gt;

&lt;p&gt;This is the question I hear most often, and it deserves a thorough answer — because the differentiation is critical.&lt;/p&gt;

&lt;p&gt;General-purpose large language models like ChatGPT, Claude, or Gemini are remarkable. They represent some of the most significant technological achievements of our generation. They can write poetry, debug code, summarize research papers, and hold conversations that feel genuinely human. I have enormous respect for the teams that built them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But they are not optimized for Indian legal workflows.&lt;/strong&gt; And in law, "not optimized" is not a minor inconvenience — it is a serious liability.&lt;/p&gt;

&lt;p&gt;Here is why:&lt;/p&gt;

&lt;p&gt;A general-purpose model like ChatGPT &lt;strong&gt;does not maintain a live, structured legal retrieval index internally&lt;/strong&gt;. It does not have a database of 43,000+ Supreme Court judgments that it can search through in real time. When you ask it a legal question, it generates an answer from its training data — which means it is reconstructing legal knowledge from memory, not retrieving it from verified sources.&lt;/p&gt;

&lt;p&gt;This leads to several critical problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;It cannot guarantee exact citations.&lt;/strong&gt; When ChatGPT cites a case, there is no mechanism to verify that the citation is accurate, that the case exists, or that the holding it describes is correct. It may be right. It may be wrong. You have no way to know without doing the research yourself — which defeats the entire purpose.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;It may compress or approximate precedent chains.&lt;/strong&gt; Legal reasoning depends on the precise chain of precedents — which case cited which, what principle was established, how it was distinguished or overruled. A general-purpose model may summarize this chain in a way that loses critical nuance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;It may hallucinate paragraph numbers, holdings, or even entire judgments.&lt;/strong&gt; This is not a theoretical risk. It happens regularly. I have personally tested dozens of legal queries on leading AI platforms and found fabricated case names, invented statutory sections, and confidently stated holdings that bear no resemblance to reality.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;It is optimized broadly across all domains&lt;/strong&gt;, not specifically for Indian jurisprudence. The same model that answers your legal question also writes marketing copy, generates recipes, and helps with algebra homework. That breadth is its strength in general use — and its weakness in specialized domains.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;NyayAI takes a fundamentally different approach.&lt;/strong&gt; It is specifically engineered for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Indian legal retrieval&lt;/strong&gt; — semantic search over a curated, structured corpus of Indian legal documents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Citation-grounded answers&lt;/strong&gt; — every claim is backed by a specific, verifiable source&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic search over Supreme Court judgments&lt;/strong&gt; — not keyword matching, but meaning-based retrieval that understands legal concepts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Statute and precedent linking&lt;/strong&gt; — connecting statutory provisions to the case law that interprets them&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured metadata retrieval&lt;/strong&gt; — bench composition, citation numbers, judgment dates, and more&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Legal-domain-specific Retrieval-Augmented Generation (RAG)&lt;/strong&gt; — a pipeline that ensures the AI's responses are anchored in real legal text, not generated from memory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The analogy I use most often is this: &lt;strong&gt;GitHub Copilot is better than raw autocomplete for coding. NyayAI is better than raw ChatGPT for Indian law.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Both are built on top of powerful AI. But one is generic, and the other is purpose-built. A generic LLM is broad intelligence. NyayAI is &lt;strong&gt;domain infrastructure&lt;/strong&gt; for Indian law.&lt;/p&gt;

&lt;p&gt;That is similar to how Bloomberg exists despite Google, or how Westlaw exists despite search engines. The general tool is powerful. The specialized tool is indispensable.&lt;/p&gt;




&lt;h2&gt;
  
  
  NyayAI's Core Features
&lt;/h2&gt;

&lt;p&gt;NyayAI is not a concept or a pitch deck. It is a working product — live, deployed, and functional. Here is what it does today:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Grounded Legal Answers
&lt;/h3&gt;

&lt;p&gt;Every response NyayAI produces is backed by &lt;strong&gt;actual legal sources&lt;/strong&gt; — not generated from memory, not reconstructed from training data, not hallucinated from statistical patterns. When NyayAI cites a case, that case exists. When it quotes a statutory provision, that provision is real. Citations are traceable, verifiable, and linked directly to the source text.&lt;/p&gt;

&lt;p&gt;This is the single most important feature of the platform. In law, an unverifiable claim is worse than no claim at all. NyayAI ensures that &lt;strong&gt;every answer has a paper trail&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. 354,293 Legal Documents Indexed
&lt;/h3&gt;

&lt;p&gt;NyayAI's knowledge base is not a small sample or a curated subset. It encompasses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Supreme Court Judgments (1950–2025)&lt;/strong&gt; — 75 years of the highest court's jurisprudence, covering constitutional law, criminal law, civil law, tax law, environmental law, labor law, and every other domain the Court has adjudicated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;858 Central Acts&lt;/strong&gt; — the complete body of parliamentary legislation currently in force&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Constitution of India&lt;/strong&gt; — all articles, amendments, schedules, and provisions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a corpus of &lt;strong&gt;1.52 GB of structured legal text&lt;/strong&gt; — cleaned, chunked, embedded, and indexed for semantic retrieval.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Real-Time Streaming Responses
&lt;/h3&gt;

&lt;p&gt;NyayAI does not make you wait for a complete response before displaying it. Answers stream &lt;strong&gt;word by word, in real time&lt;/strong&gt;, just like the experience you are accustomed to with ChatGPT or other modern AI interfaces. This makes the interaction feel natural, responsive, and fast — even when the underlying analysis is complex.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Citation Cards with Metadata
&lt;/h3&gt;

&lt;p&gt;Each source citation in a NyayAI response is presented as a &lt;strong&gt;rich citation card&lt;/strong&gt; that includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Case title&lt;/strong&gt; (e.g., &lt;em&gt;Kesavananda Bharati v. State of Kerala&lt;/em&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Year of judgment&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bench composition&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Citation number&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The actual chunk of legal text&lt;/strong&gt; that was used to generate the answer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not a footnote with a case name. It is a complete, contextual reference that allows you to evaluate the source yourself.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Three Response Modes
&lt;/h3&gt;

&lt;p&gt;Different legal questions require different levels of depth. NyayAI offers three distinct response modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Concise&lt;/strong&gt; — Quick, 2–4 sentence answers for straightforward queries. Ideal when you need a fast answer and already have context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detailed&lt;/strong&gt; — Structured legal analysis with organized sections, relevant precedents, and statutory references. Suitable for most professional research tasks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Research&lt;/strong&gt; — A full legal research memo with case-by-case breakdown, comprehensive precedent analysis, and detailed statutory interpretation. Designed for complex legal questions that require thorough treatment.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  6. Collapsible Citation Interface
&lt;/h3&gt;

&lt;p&gt;When NyayAI retrieves sources, they are organized into a &lt;strong&gt;collapsible interface&lt;/strong&gt; grouped by source type — Supreme Court Judgments, Central Acts, and Constitution. Each group shows a &lt;strong&gt;summary count&lt;/strong&gt; of the number of sources retrieved, and you can expand or collapse each group to manage the information density. This keeps the interface clean while making every source accessible.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Confidence Scoring
&lt;/h3&gt;

&lt;p&gt;Each response includes &lt;strong&gt;relevance scores&lt;/strong&gt; for the retrieved sources. This allows you to assess how closely the source material matches your query. A high-relevance citation on a niche topic is more useful than a tangentially related one — and NyayAI makes that distinction visible.&lt;/p&gt;

&lt;h3&gt;
  
  
  8. Source-Aware Routing
&lt;/h3&gt;

&lt;p&gt;Not every legal question requires the same type of source material. A question about fundamental rights needs the Constitution. A question about criminal procedure needs the relevant statute. A question about judicial interpretation needs case law. NyayAI's retrieval system &lt;strong&gt;intelligently routes queries to the right type of legal document&lt;/strong&gt;, ensuring that the sources it retrieves are appropriate for the question being asked.&lt;/p&gt;

&lt;h3&gt;
  
  
  9. Dark Professional UI
&lt;/h3&gt;

&lt;p&gt;NyayAI features a &lt;strong&gt;deep navy and gold themed interface&lt;/strong&gt; designed specifically for extended legal research sessions. The dark theme reduces eye strain during long working hours, while the gold accents convey professionalism and authority. Every element of the interface — from typography to spacing to the citation cards — has been designed with legal professionals in mind.&lt;/p&gt;

&lt;h3&gt;
  
  
  10. Mobile Responsive
&lt;/h3&gt;

&lt;p&gt;Legal research does not always happen at a desk. NyayAI is fully responsive and works seamlessly on &lt;strong&gt;phones, tablets, and desktops&lt;/strong&gt;. Whether you are in a courtroom, in a meeting, or on a train, the full power of the platform is available in your pocket.&lt;/p&gt;

&lt;h3&gt;
  
  
  11. Secure Access Gate
&lt;/h3&gt;

&lt;p&gt;Access to NyayAI is protected by an &lt;strong&gt;access-code authentication system&lt;/strong&gt;. This ensures that the platform remains secure and that usage can be managed and monitored during the current phase of development and rollout.&lt;/p&gt;




&lt;h2&gt;
  
  
  What We Built — The Engineering Summary
&lt;/h2&gt;

&lt;p&gt;I want to give you a sense of what went into building NyayAI without diving into technical jargon. The engineering behind this platform is significant, and it is worth understanding at a high level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We built a custom legal corpus from scratch.&lt;/strong&gt; This meant acquiring, ingesting, cleaning, and structuring 1.52 GB of Indian legal text. Raw legal documents are messy — inconsistent formatting, OCR artifacts, encoding issues, missing metadata. Every document in our corpus has been processed, normalized, and structured for machine consumption.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We fine-tuned an AI model specifically on Indian legal instruction pairs.&lt;/strong&gt; This is not a generic model that happens to answer legal questions. It is a model that has been trained on thousands of examples of Indian legal reasoning — questions and answers, case analysis, statutory interpretation, and constitutional commentary. The model understands legal language, legal structure, and legal reasoning in a way that generic models do not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We built a semantic search engine over 354,293 legal documents.&lt;/strong&gt; This is not keyword search. When you ask NyayAI a question, it does not look for documents that contain the exact words you used. It understands the &lt;strong&gt;meaning&lt;/strong&gt; of your question and retrieves documents that are semantically relevant — even if they use different terminology.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We designed a retrieval-augmented generation pipeline&lt;/strong&gt; that grounds every answer in actual source text. The AI does not answer from memory. It retrieves relevant documents first, then generates its response based on those documents. This is what makes the answers verifiable and trustworthy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We built a streaming inference server that scales to zero when idle.&lt;/strong&gt; This means we are not paying for expensive GPU compute when no one is using the platform. When a user sends a query, the server spins up, processes the request, streams the response, and then scales back down. This is critical for cost efficiency at our current stage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We built a modern web application with a premium interface.&lt;/strong&gt; The frontend is fast, responsive, and professionally designed. It is not an afterthought or a demo UI — it is a production-quality application built for real users.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We deployed globally&lt;/strong&gt; — the AI backend runs on GPU cloud infrastructure with high-performance hardware, and the frontend is served from a global content delivery network for fast load times anywhere in the world.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Technology Stack
&lt;/h2&gt;

&lt;p&gt;For those interested in the technical foundations, here is what powers NyayAI at a high level:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Qwen-3 4B, fine-tuned with LoRA on Indian legal instruction data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Embedding Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;BGE-M3 multilingual model for semantic search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vector Database&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;FAISS with 354,293 indexed document chunks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Backend&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;FastAPI on Modal (serverless GPU cloud with L4 GPUs)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Frontend&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Next.js 16 deployed on Vercel&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Streaming&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Server-Sent Events for real-time token streaming&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every component has been chosen deliberately — for performance, for cost efficiency, and for scalability. This is not a stack assembled from tutorials. It is a stack engineered for production-grade legal AI.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Vision: Where NyayAI is Headed
&lt;/h2&gt;

&lt;p&gt;What exists today is the foundation. The vision is much larger.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multilingual legal access&lt;/strong&gt; is at the top of the roadmap. India has 22 officially recognized languages and hundreds of dialects. The law should be accessible in all of them. We are working toward a future where a farmer in Tamil Nadu can ask a legal question in Tamil and receive an accurate, sourced answer — not a rough translation, but a genuine legal response in their own language.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;High Court judgment coverage&lt;/strong&gt; is the next major expansion of the corpus. India has 25 High Courts, each with its own body of case law. Adding High Court judgments will dramatically expand NyayAI's coverage and make it relevant for a much wider range of legal questions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tribunal and district court coverage&lt;/strong&gt; will follow. Specialized tribunals — NCLT, NCLAT, ITAT, SAT, NGT, and others — handle an enormous volume of cases in specialized domains. District courts are where most litigation begins. Covering these courts will make NyayAI comprehensive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Legal document drafting&lt;/strong&gt; is a natural extension. Once the system understands the law deeply enough, it can assist in drafting legal notices, petitions, contracts, and other documents — grounded in actual legal provisions and precedents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Case outcome prediction&lt;/strong&gt; is an ambitious but achievable goal. By analyzing patterns in historical judgments — how similar cases were decided, which arguments succeeded, which factors were decisive — NyayAI can provide probabilistic assessments of likely outcomes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lawyer workflow tools&lt;/strong&gt; will transform how legal professionals work. Brief generation, argument builders, precedent chains, counter-argument analysis — these are tools that can save lawyers hours of work on every case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vernacular access for citizens&lt;/strong&gt; is perhaps the most important long-term goal. Most Indians are not lawyers. They are citizens who need to understand their rights, their obligations, and their options. NyayAI should be accessible to them — in their language, at their level of understanding, at a price they can afford.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API access for legal tech platforms&lt;/strong&gt; will allow other developers and companies to build on top of NyayAI's infrastructure. The legal corpus, the retrieval engine, and the AI model can serve as the foundation for an ecosystem of legal technology applications.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Competitive Landscape
&lt;/h2&gt;

&lt;p&gt;I am often asked: &lt;em&gt;"What if OpenAI builds this? What if Anthropic enters the Indian legal market? What if some well-funded Bengaluru startup beats you to it?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;These are fair questions. Here is my honest answer:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Even if all of them enter this space — and some of them will — most Indian courts, lawyers, and litigants STILL do not have AI-assisted research, retrieval systems, legal summarization, vernacular access, or affordable tooling.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That gap is not going to be filled by one company. The Indian legal system is enormous — 50 million pending cases, millions of legal professionals, 1.4 billion citizens. There is room for multiple players, and the market is so underserved that even modest penetration represents significant impact.&lt;/p&gt;

&lt;p&gt;But more importantly, NyayAI has advantages that are difficult to replicate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Domain specificity.&lt;/strong&gt; We are not trying to be good at everything. We are trying to be the best at one thing: Indian law.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Curated corpus.&lt;/strong&gt; Our legal corpus is not scraped from the internet. It is carefully curated, cleaned, and structured for legal AI applications.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-tuned model.&lt;/strong&gt; Our AI model is not generic. It has been trained specifically on Indian legal reasoning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ground-up architecture.&lt;/strong&gt; Every component of the system — from the embedding pipeline to the retrieval engine to the user interface — has been designed for legal use cases.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;NyayAI is &lt;strong&gt;purpose-built&lt;/strong&gt; to fill a gap that generic tools cannot fill and that existing legal platforms have not addressed. That is our moat, and we are deepening it every day.&lt;/p&gt;




&lt;h2&gt;
  
  
  Progress So Far
&lt;/h2&gt;

&lt;p&gt;NyayAI is approximately &lt;strong&gt;75% complete&lt;/strong&gt; on the journey from concept to market-ready product.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What we have already crossed:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Data acquisition — sourcing 75 years of Supreme Court judgments, 858 Central Acts, and the Constitution&lt;/li&gt;
&lt;li&gt;✅ Data cleaning and structuring — normalizing 1.52 GB of messy legal text into machine-readable format&lt;/li&gt;
&lt;li&gt;✅ Document chunking — breaking legal documents into semantically meaningful segments&lt;/li&gt;
&lt;li&gt;✅ Embedding generation — converting legal text into high-dimensional vector representations&lt;/li&gt;
&lt;li&gt;✅ Retrieval engine — building semantic search over 354,293 document chunks&lt;/li&gt;
&lt;li&gt;✅ AI model fine-tuning — training on Indian legal instruction pairs&lt;/li&gt;
&lt;li&gt;✅ Inference serving — deploying the model on GPU infrastructure with streaming capabilities&lt;/li&gt;
&lt;li&gt;✅ RAG pipeline — grounding AI responses in retrieved source text&lt;/li&gt;
&lt;li&gt;✅ Streaming interface — real-time, word-by-word response delivery&lt;/li&gt;
&lt;li&gt;✅ Citation grounding — linking every answer to verifiable sources&lt;/li&gt;
&lt;li&gt;✅ Systems optimization — latency reduction, cost efficiency, scaling&lt;/li&gt;
&lt;li&gt;✅ Frontend UX — professional, responsive, production-quality interface&lt;/li&gt;
&lt;li&gt;✅ Global deployment — live and accessible&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What remains:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🔲 Trust and reliability — ensuring consistent accuracy across edge cases&lt;/li&gt;
&lt;li&gt;🔲 Distribution — reaching lawyers, law firms, courts, and citizens&lt;/li&gt;
&lt;li&gt;🔲 Onboarding — making the first-use experience seamless&lt;/li&gt;
&lt;li&gt;🔲 User retention — building habits and workflows around NyayAI&lt;/li&gt;
&lt;li&gt;🔲 Legal partnerships — collaborating with bar associations, law schools, and legal aid organizations&lt;/li&gt;
&lt;li&gt;🔲 Monetization — developing sustainable pricing models&lt;/li&gt;
&lt;li&gt;🔲 Sales — building a go-to-market engine&lt;/li&gt;
&lt;li&gt;🔲 Adoption loops — creating viral and referral mechanisms&lt;/li&gt;
&lt;li&gt;🔲 Consistency — ensuring quality at scale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The hardest part of building a product is not the technology. It is everything that comes after — trust, distribution, adoption, and sustainability. We are now entering that phase.&lt;/p&gt;




&lt;h2&gt;
  
  
  About the Founder
&lt;/h2&gt;

&lt;p&gt;NyayAI is built by &lt;strong&gt;Ashish Raj&lt;/strong&gt; — solo founder, architect, and builder.&lt;/p&gt;

&lt;p&gt;Every component of this platform — from the data pipelines that process raw legal text, to the model training infrastructure, to the retrieval engine, to the streaming backend, to the frontend interface you interact with — was built by one person.&lt;/p&gt;

&lt;p&gt;I do not say this to boast. I say it because it speaks to conviction. When you believe that every Indian deserves access to justice, you do not wait for a team, a budget, or permission. You build.&lt;/p&gt;

&lt;p&gt;I believe that the right AI, applied to the right domain, with the right data, can transform access to justice in India. Not incrementally. Fundamentally.&lt;/p&gt;

&lt;p&gt;That is what NyayAI is. That is what I am building. And I am just getting started.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Justice delayed is justice denied. But justice inaccessible is justice that never existed at all. NyayAI is being built to change that — one query, one citation, one answer at a time."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;— &lt;strong&gt;Ashish Raj&lt;/strong&gt;, Founder, NyayAI&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;© 2026 Ashish Raj. All rights reserved.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>startup</category>
      <category>rag</category>
      <category>llm</category>
    </item>
    <item>
      <title>NyayAI: Building an AI Legal Assistant for 1.4 Billion People — A Technical Deep Dive</title>
      <dc:creator>Ashish Raj</dc:creator>
      <pubDate>Mon, 25 May 2026 07:02:24 +0000</pubDate>
      <link>https://dev.to/ashish_raj_04/nyayai-building-an-ai-legal-assistant-for-14-billion-people-a-technical-deep-dive-328o</link>
      <guid>https://dev.to/ashish_raj_04/nyayai-building-an-ai-legal-assistant-for-14-billion-people-a-technical-deep-dive-328o</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;I'm building a startup to make Indian law accessible to every lawyer, law student, and citizen in the country. Here's the technical story of how I went from zero to a working prototype — training a foundation model from scratch, fine-tuning on 4,000 instruction pairs, building a production-ready RAG pipeline, and shipping a premium SaaS product — all as a solo founder.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;India has 1.4 billion people and roughly 50 million active legal cases pending in its courts. Lawyers spend hours — sometimes days — digging through bare acts, constitutional articles, and decades of Supreme Court judgments just to find relevant precedents for a single case. The Indian legal system operates across 25+ High Courts, hundreds of tribunals, and a Supreme Court that has delivered judgments since 1950. The sheer volume is staggering.&lt;/p&gt;

&lt;p&gt;And yet, the tooling available to lawyers is stuck in 2005. Paid databases like SCC Online and Manupatra charge thousands per month and still require manual keyword searches. Free resources like Indian Kanoon are search-only — no summaries, no analysis, no drafting. Generic AI tools like ChatGPT hallucinate case names, invent sections that don't exist, and have no depth in Indian law.&lt;/p&gt;

&lt;p&gt;I wanted to change that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NyayAI&lt;/strong&gt; (न्याय = justice in Sanskrit) is an AI-powered legal assistant that understands Indian law — not superficially, but deeply. It can look up any section of any central act, summarize Supreme Court judgments, answer complex legal questions with grounded citations, and eventually draft legal documents. Think of it as ChatGPT, but one that actually passed the bar exam for Indian law.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Not Just Use ChatGPT?
&lt;/h3&gt;

&lt;p&gt;This is the question I get asked most often. The answer is simple: &lt;strong&gt;a general-purpose model is broad intelligence; NyayAI is domain infrastructure for Indian law.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A general-purpose model like ChatGPT:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does not maintain a live, structured legal retrieval index internally&lt;/li&gt;
&lt;li&gt;Cannot guarantee exact citations from 43,000+ judgments&lt;/li&gt;
&lt;li&gt;May compress or approximate precedent chains&lt;/li&gt;
&lt;li&gt;May hallucinate paragraph numbers or holdings occasionally&lt;/li&gt;
&lt;li&gt;Is optimized broadly across all domains, not specifically for Indian jurisprudence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;NyayAI is specifically engineered for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Indian legal retrieval&lt;/strong&gt; — semantic search over the full corpus of Supreme Court judgments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Citation-grounded answers&lt;/strong&gt; — every response is backed by actual legal text, not model memory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Statute + precedent linking&lt;/strong&gt; — connecting Constitutional articles, Central Acts, and case law&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured metadata retrieval&lt;/strong&gt; — case title, bench, citation number, year, disposal type&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Legal-domain-specific RAG&lt;/strong&gt; — retrieval-augmented generation tuned for jurisprudence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The analogy is precise: GitHub Copilot is better than raw autocomplete for coding. Bloomberg exists despite Google. Westlaw exists despite search engines. &lt;strong&gt;NyayAI exists because Indian law deserves its own intelligence layer.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This blog post is a technical deep dive into everything I've built — the data pipelines, the model architecture decisions, the training infrastructure, the RAG pipeline, the production frontend, and the results. Every number, every decision, every failed experiment is documented here.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 0: The 103M Parameter Experiment (The Learning Phase)
&lt;/h2&gt;

&lt;p&gt;Before touching any pretrained model, I wanted to understand transformers at the deepest level. Not "import transformers and call &lt;code&gt;.fit()&lt;/code&gt;" — I mean &lt;strong&gt;implementing a GPT-style transformer from scratch in PyTorch&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture
&lt;/h3&gt;

&lt;p&gt;I built a decoder-only transformer with the following specifications:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total Parameters&lt;/td&gt;
&lt;td&gt;103,457,280 (~103M)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Layers&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Attention Heads&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embedding Dimension&lt;/td&gt;
&lt;td&gt;768&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context Window&lt;/td&gt;
&lt;td&gt;512 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vocabulary Size&lt;/td&gt;
&lt;td&gt;50,257 (GPT-2 tokenizer)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output Head&lt;/td&gt;
&lt;td&gt;Weight-tied with embedding layer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The model was trained on &lt;strong&gt;269 million tokens&lt;/strong&gt; (1.25 GB) of Indian legal text — the same corpus I'd later use for the production pipeline. Training ran on &lt;strong&gt;NVIDIA A100 GPUs via Modal&lt;/strong&gt; for 2 epochs across 59,000 gradient steps.&lt;/p&gt;

&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Final Validation Loss&lt;/td&gt;
&lt;td&gt;2.46&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Perplexity&lt;/td&gt;
&lt;td&gt;11.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Training Time&lt;/td&gt;
&lt;td&gt;~8 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A perplexity of 11.7 on legal text means the model learned the structure and vocabulary of Indian legal language reasonably well. It could generate coherent legal-sounding text, but it was not a &lt;em&gt;useful&lt;/em&gt; model — it had no instruction-following capability and no factual grounding. It was a learning exercise, and it served its purpose brilliantly.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Takeaway:&lt;/strong&gt; Building a transformer from scratch taught me more about attention mechanisms, positional encoding, loss landscapes, and gradient dynamics than any course or paper ever could. If you're serious about ML, I strongly recommend doing this at least once.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Phase 1: Data Acquisition — The Foundation of Everything
&lt;/h2&gt;

&lt;p&gt;A model is only as good as its data. For NyayAI, I needed three categories of legal text:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The Constitution of India&lt;/strong&gt; — the supreme law, 395+ articles&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Central Acts (Bare Acts)&lt;/strong&gt; — the 858 laws passed by Parliament&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Supreme Court Judgments&lt;/strong&gt; — 75 years of case law (1950–2025)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  1A. The Constitution of India
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Source:&lt;/strong&gt; A structured JSON file containing all articles with metadata (article number, title, description).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pipeline:&lt;/strong&gt; A straightforward JSON-to-text converter that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Parses each article from the JSON&lt;/li&gt;
&lt;li&gt;Cleans escaped newlines and normalizes whitespace&lt;/li&gt;
&lt;li&gt;Preserves repealed articles with notation&lt;/li&gt;
&lt;li&gt;Formats as structured text with &lt;code&gt;Article N — Title&lt;/code&gt; headers&lt;/li&gt;
&lt;li&gt;Separates each article with &lt;code&gt;&amp;lt;|endoftext|&amp;gt;&lt;/code&gt; tokens for clean document boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Articles Processed&lt;/td&gt;
&lt;td&gt;395+ (including amendments)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;File Size&lt;/td&gt;
&lt;td&gt;502 KB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Estimated Tokens&lt;/td&gt;
&lt;td&gt;~106,000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Constitution is small but dense — every article matters. The Preamble alone is one of the most frequently cited legal texts in Indian jurisprudence.&lt;/p&gt;

&lt;h3&gt;
  
  
  1B. Central Acts (858 Bare Acts)
&lt;/h3&gt;

&lt;p&gt;This was significantly more complex. India has 858 central acts in force, ranging from the Indian Penal Code (1860) to the Digital Personal Data Protection Act (2023). These were stored as deeply nested JSON files with a schema that included:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Act Title, Act ID, Enactment Date, Act Definition
├── Chapters/Parts
│   ├── Sections
│   │   └── Paragraphs (strings or nested dicts with text/contains)
│   └── Subheadings
│       └── Sections
├── Schedules, Annexures, Appendix, Forms
└── Footnotes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pipeline:&lt;/strong&gt; A recursive JSON traversal engine that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Handles BOM encoding&lt;/strong&gt; — many Indian government JSON files contain a byte-order mark&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recursively extracts paragraphs&lt;/strong&gt; — handles arbitrarily nested &lt;code&gt;text/contains&lt;/code&gt; structures with proper indentation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cleans legislative artifacts&lt;/strong&gt; — removes footnote reference numbers, strips decorative markers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sorts sections numerically&lt;/strong&gt; — a custom sort function ensures Section 2 comes before Section 10&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Processes chapters, subheadings, schedules, annexures, and footnotes&lt;/strong&gt; — preserving the full hierarchical structure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outputs with &lt;code&gt;&amp;lt;|endoftext|&amp;gt;&lt;/code&gt; boundaries&lt;/strong&gt; between each act&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Acts Processed&lt;/td&gt;
&lt;td&gt;858&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;File Size&lt;/td&gt;
&lt;td&gt;29.9 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total Words&lt;/td&gt;
&lt;td&gt;~5,076,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Estimated Tokens&lt;/td&gt;
&lt;td&gt;~6,600,000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  1C. Supreme Court Judgments (1950–2025)
&lt;/h3&gt;

&lt;p&gt;This was the heavy lift — and the most valuable data. The Supreme Court of India has delivered tens of thousands of judgments over 75 years. I sourced these from the &lt;strong&gt;AWS Open Data Registry&lt;/strong&gt; (&lt;code&gt;s3://indian-supreme-court-judgments&lt;/code&gt;), a public bucket containing judgment PDFs and metadata JSONs organized by year.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Download&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uses &lt;code&gt;boto3&lt;/code&gt; with unsigned requests (public bucket, no auth needed)&lt;/li&gt;
&lt;li&gt;Downloads English judgment tar files and metadata tar files for each year (1950–2026)&lt;/li&gt;
&lt;li&gt;Implements &lt;strong&gt;resume support&lt;/strong&gt; — skips files that already exist with correct size&lt;/li&gt;
&lt;li&gt;Progress logging with download speed tracking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Extract &amp;amp; Process&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the most complex pipeline in the entire project. It:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Extracts metadata tars&lt;/strong&gt; — unpacks year-by-year JSON metadata files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parses metadata HTML&lt;/strong&gt; — each judgment's metadata is stored as raw HTML. A dedicated parser extracts:

&lt;ul&gt;
&lt;li&gt;Case title (petitioner vs respondent)&lt;/li&gt;
&lt;li&gt;Judges/Coram&lt;/li&gt;
&lt;li&gt;Decision date&lt;/li&gt;
&lt;li&gt;Case number&lt;/li&gt;
&lt;li&gt;Bench size&lt;/li&gt;
&lt;li&gt;Citation&lt;/li&gt;
&lt;li&gt;Disposal nature&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extracts text from PDFs&lt;/strong&gt; — uses &lt;strong&gt;PyMuPDF (fitz)&lt;/strong&gt; to extract text from judgment PDFs, then cleans:

&lt;ul&gt;
&lt;li&gt;Page headers/footers ("SUPREME COURT REPORTS", standalone page numbers)&lt;/li&gt;
&lt;li&gt;Excessive whitespace&lt;/li&gt;
&lt;li&gt;Year-only lines (standalone "1950", "2023", etc.)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Matches PDFs to metadata&lt;/strong&gt; — correlates each PDF with its extracted case metadata by path key&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Formats each judgment&lt;/strong&gt; as a structured document with a header block (title, citation, case number, date, bench, disposal) followed by the full judgment text&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Processes year-by-year&lt;/strong&gt; — streams output to avoid loading 1.5 GB of text into memory at once&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Judgments Processed&lt;/td&gt;
&lt;td&gt;43,324&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;File Size&lt;/td&gt;
&lt;td&gt;1.49 GB (1,588,861,395 bytes)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total Words&lt;/td&gt;
&lt;td&gt;~261,000,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Estimated Tokens&lt;/td&gt;
&lt;td&gt;~339,300,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time Span&lt;/td&gt;
&lt;td&gt;1950–2025 (75 years)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Total Corpus Summary
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;File Size&lt;/th&gt;
&lt;th&gt;Tokens (est.)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Constitution of India&lt;/td&gt;
&lt;td&gt;502 KB&lt;/td&gt;
&lt;td&gt;~106K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Central Acts (858 acts)&lt;/td&gt;
&lt;td&gt;29.9 MB&lt;/td&gt;
&lt;td&gt;~6.6M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SC Judgments (43,324 cases)&lt;/td&gt;
&lt;td&gt;1.49 GB&lt;/td&gt;
&lt;td&gt;~339.3M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~1.52 GB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~346 Million&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is a genuinely massive legal corpus — 346 million tokens of structured, cleaned Indian legal text spanning 75 years of Supreme Court jurisprudence, the entire Constitution, and every central act in force.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 1.5: Synthetic Instruction Dataset Generation
&lt;/h2&gt;

&lt;p&gt;A language model that can continue legal text is interesting but not useful. To make it follow instructions — answer questions, summarize cases, compare sections — I needed an &lt;strong&gt;instruction-response dataset&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Creating thousands of high-quality legal Q&amp;amp;A pairs by hand was not feasible. Instead, I built a &lt;strong&gt;synthetic data generation pipeline&lt;/strong&gt; using Google's Gemini API.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Approach
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Random chunk sampling&lt;/strong&gt; — for each batch, randomly select a ~40,000 character chunk from one of the three source files, with a weighted distribution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;60% Supreme Court judgments (largest, most diverse)&lt;/li&gt;
&lt;li&gt;30% Central Acts (statute-heavy, structured)&lt;/li&gt;
&lt;li&gt;10% Constitution (fundamental, frequently referenced)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Structured prompting&lt;/strong&gt; — each chunk is sent to &lt;code&gt;gemini-3.1-flash-lite&lt;/code&gt; with a carefully crafted prompt that enforces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No hallucination&lt;/strong&gt; — responses must be based &lt;em&gt;strictly&lt;/em&gt; on the provided text excerpt&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diversity in length and complexity&lt;/strong&gt; — each batch of 5 pairs follows a prescribed format:

&lt;ul&gt;
&lt;li&gt;Task 1: Very Long (3-4 paragraph comprehensive summary/brief)&lt;/li&gt;
&lt;li&gt;Task 2: Medium (legal argument/analysis)&lt;/li&gt;
&lt;li&gt;Task 3: Medium (comparison of concepts)&lt;/li&gt;
&lt;li&gt;Task 4: Short (direct factual question)&lt;/li&gt;
&lt;li&gt;Task 5: Short (yes/no client question with explanation)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured output&lt;/strong&gt; — uses Pydantic models with &lt;code&gt;response_mime_type: application/json&lt;/code&gt; for reliable parsing&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Incremental saving&lt;/strong&gt; — pairs are appended to a JSONL file as they're generated, with a running count. Supports resume (checks existing pair count on startup).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Rate limiting&lt;/strong&gt; — 4-second sleep between requests to respect the free tier (15 RPM).&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Output
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Generated Pairs&lt;/td&gt;
&lt;td&gt;~4,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;File Size&lt;/td&gt;
&lt;td&gt;2.09 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Source Distribution&lt;/td&gt;
&lt;td&gt;60% judgments, 30% acts, 10% constitution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Generation Model&lt;/td&gt;
&lt;td&gt;Gemini 3.1 Flash Lite&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;$0 (free tier API)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Training Data Distribution Analysis
&lt;/h3&gt;

&lt;p&gt;After generation, I analyzed the response length distribution — this turned out to be a critical insight for understanding model behavior later:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stat&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Median response&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;110 words (~150 tokens)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;25th percentile&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;45 words&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;75th percentile&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;147 words (~200 tokens)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Max response&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;367 words (~490 tokens)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Under 100 words&lt;/td&gt;
&lt;td&gt;45% of all training data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100-200 words&lt;/td&gt;
&lt;td&gt;37%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;200-400 words&lt;/td&gt;
&lt;td&gt;18%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;400+ words&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This distribution matters enormously: &lt;strong&gt;the model will learn to produce responses at the length distribution it was trained on&lt;/strong&gt;. More on this in Phase 4B.&lt;/p&gt;

&lt;p&gt;The critical insight here: &lt;strong&gt;the quality of your instruction data matters far more than quantity&lt;/strong&gt;. The original Stanford Alpaca paper used only 52K pairs to teach instruction-following to LLaMA. For a domain-specific model, 2,000-4,000 high-quality, grounded pairs are more than enough — as long as they're diverse in task type and faithful to the source material.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 2: Fine-Tuning — Teaching the Model Indian Law
&lt;/h2&gt;

&lt;p&gt;With data in hand, it was time to take a state-of-the-art pretrained model and teach it to be an Indian legal expert.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model Selection: Qwen-3 4B Instruct
&lt;/h3&gt;

&lt;p&gt;After evaluating several sub-6B parameter models (Phi-4-mini, SmolLM3-3B, Gemma-3n-E2B), I chose &lt;strong&gt;Qwen-3 4B Instruct (2507 variant)&lt;/strong&gt; for several reasons:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Factor&lt;/th&gt;
&lt;th&gt;Why Qwen-3 4B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reasoning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Exceptional chain-of-thought and instruction following&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multilingual&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Strong Hindi support (critical for Indian legal market)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Architecture&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Modern optimizations, efficient attention&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ecosystem&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Massive HuggingFace community, well-documented&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;License&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache 2.0 — fully commercial use&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4B parameters — fits in a single L4 GPU (24GB) in bfloat16&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Training Infrastructure
&lt;/h3&gt;

&lt;p&gt;Everything runs on &lt;strong&gt;Modal&lt;/strong&gt; — a serverless GPU cloud that lets you define your entire training pipeline in a single Python file and run it with one command. The entire training pipeline — from data loading to checkpoint saving — executes remotely on Modal. Checkpoints are saved to a Modal Volume and automatically downloaded to my local machine after each epoch.&lt;/p&gt;

&lt;h3&gt;
  
  
  LoRA: Training Smart, Not Expensive
&lt;/h3&gt;

&lt;p&gt;Fine-tuning all 4 billion parameters would require multiple GPUs and cost hundreds of dollars. Instead, I implemented &lt;strong&gt;LoRA (Low-Rank Adaptation)&lt;/strong&gt; from scratch — no HuggingFace PEFT library, no Unsloth, no shortcuts.&lt;/p&gt;

&lt;h4&gt;
  
  
  How LoRA Works
&lt;/h4&gt;

&lt;p&gt;Instead of updating the full weight matrix W (size &lt;code&gt;d × d&lt;/code&gt;), LoRA decomposes the update into two small matrices:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;W' = W + α(A × B)
where A is (d × r) and B is (r × d), and r &amp;lt;&amp;lt; d
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For rank r=16 and dimension d=768, instead of updating 589,824 parameters per layer, you're updating 16×768 + 16×768 = 24,576 parameters — a &lt;strong&gt;24x reduction&lt;/strong&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Implementation
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LORALayer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;in_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;in_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Parameter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out_dim&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;A&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LinearWithLoRA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;linear&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linear&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;linear&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lora&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LORALayer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;linear&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;in_features&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;linear&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;out_features&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lora&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;B&lt;/code&gt; matrix is initialized to zeros, so at the start of training, &lt;code&gt;LoRA(x) = α × (A(x) @ 0) = 0&lt;/code&gt;. The model starts exactly where the pretrained model left off — no disruption. As training progresses, the LoRA layers learn domain-specific adaptations while the base model stays frozen.&lt;/p&gt;

&lt;h4&gt;
  
  
  Target Modules
&lt;/h4&gt;

&lt;p&gt;LoRA adapters were injected into the &lt;strong&gt;attention layers only&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;lora_target_modules&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;k_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;o_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Hyperparameters
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;th&gt;Rationale&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LoRA Rank&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;Sweet spot: enough capacity for domain adaptation without overfitting on ~4K pairs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LoRA Alpha&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;α/r = 2.0 scaling factor — standard choice&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peak Learning Rate&lt;/td&gt;
&lt;td&gt;2e-5&lt;/td&gt;
&lt;td&gt;Conservative — avoiding catastrophic forgetting of base model knowledge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Minimum Learning Rate&lt;/td&gt;
&lt;td&gt;2e-6&lt;/td&gt;
&lt;td&gt;10x decay from peak&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Warmup Steps&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;Quick ramp to prevent early instability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Batch Size&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Fits in L4 VRAM with gradient checkpointing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max Sequence Length&lt;/td&gt;
&lt;td&gt;8,192&lt;/td&gt;
&lt;td&gt;Full context window of Qwen-3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weight Decay&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;Standard regularization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gradient Clipping&lt;/td&gt;
&lt;td&gt;1.0 (max norm)&lt;/td&gt;
&lt;td&gt;Prevents exploding gradients on long legal sequences&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Optimizer&lt;/td&gt;
&lt;td&gt;AdamW&lt;/td&gt;
&lt;td&gt;Only over LoRA parameters&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Precision&lt;/td&gt;
&lt;td&gt;bfloat16&lt;/td&gt;
&lt;td&gt;Native on L4, no precision loss for this scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Epochs&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Sufficient for convergence on this dataset size&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Parameter Efficiency
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total Model Parameters&lt;/td&gt;
&lt;td&gt;~4,000,000,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frozen (Base Model)&lt;/td&gt;
&lt;td&gt;~3,988,200,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Trainable (LoRA)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~11,800,000&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Parameter Ratio&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~0.30%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We're training less than 0.3% of the model's parameters. The LoRA adapter checkpoint is &lt;strong&gt;~135 MB&lt;/strong&gt; — compared to the full model's ~8 GB in bfloat16.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Formatting: ChatML
&lt;/h3&gt;

&lt;p&gt;Every instruction-response pair is formatted in &lt;strong&gt;ChatML&lt;/strong&gt; (the template Qwen expects):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="na"&gt;im_start&lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;system
You are an expert Indian Legal Assistant.&lt;span class="nt"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="na"&gt;im_end&lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="na"&gt;im_start&lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;user
What are the key provisions of Section 14 of the Hindu Succession Act?&lt;span class="nt"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="na"&gt;im_end&lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="na"&gt;im_start&lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;assistant
Section 14 of the Hindu Succession Act, 1956, is a landmark provision...&lt;span class="nt"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="na"&gt;im_end&lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Custom Collation: Dynamic Batch Padding
&lt;/h3&gt;

&lt;p&gt;Rather than padding all sequences to the maximum model length (8,192 tokens), I implemented &lt;strong&gt;dynamic batch padding&lt;/strong&gt; — each batch is padded only to the length of its longest sequence. This saves enormous amounts of compute. If a batch's longest sequence is 1,200 tokens, we're processing 1,200 × 4 = 4,800 tokens instead of 8,192 × 4 = 32,768 tokens. On average, this reduces compute by ~70-80%.&lt;/p&gt;

&lt;h3&gt;
  
  
  Learning Rate Schedule: Cosine with Linear Warmup
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Linear warmup&lt;/strong&gt; (0 → 2e-5 over 50 steps) — prevents early training instability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cosine decay&lt;/strong&gt; (2e-5 → 2e-6 over remaining steps) — smooth convergence without sharp drops&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Memory Optimization: Gradient Checkpointing
&lt;/h3&gt;

&lt;p&gt;With 4B parameters in bfloat16, the model alone takes ~8GB of VRAM. Add optimizer states, gradients, and activations for 8,192-token sequences, and you blow past 24GB easily. &lt;strong&gt;Gradient checkpointing&lt;/strong&gt; trades ~30% more compute time for ~40% VRAM savings — the difference between fitting and OOM.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fault-Tolerant Training: The Generator Pattern
&lt;/h3&gt;

&lt;p&gt;Training on cloud GPUs can fail for many reasons — preemption, network issues, timeouts. The training loop uses Python's &lt;strong&gt;generator pattern&lt;/strong&gt; (&lt;code&gt;yield&lt;/code&gt;) to stream results back to the local machine after each epoch. This means even if training crashes after epoch 1, I already have the checkpoint downloaded locally.&lt;/p&gt;

&lt;h3&gt;
  
  
  Training Results
&lt;/h3&gt;

&lt;p&gt;Training ran for &lt;strong&gt;2 full epochs&lt;/strong&gt; on an &lt;strong&gt;NVIDIA L4 GPU (24GB VRAM)&lt;/strong&gt; via Modal.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Epoch 1 End&lt;/th&gt;
&lt;th&gt;Epoch 2 End (Final)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Training Loss&lt;/td&gt;
&lt;td&gt;~1.05&lt;/td&gt;
&lt;td&gt;~0.69&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Validation Loss&lt;/td&gt;
&lt;td&gt;~1.00&lt;/td&gt;
&lt;td&gt;~0.92&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Learning Rate&lt;/td&gt;
&lt;td&gt;~1.2e-5 (mid-decay)&lt;/td&gt;
&lt;td&gt;~2e-6 (minimum)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tokens Processed&lt;/td&gt;
&lt;td&gt;~4.5M&lt;/td&gt;
&lt;td&gt;~9.0M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Global Steps&lt;/td&gt;
&lt;td&gt;~850&lt;/td&gt;
&lt;td&gt;~1,700&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Key Observations
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Smooth convergence&lt;/strong&gt; — no loss spikes, no instability. The warmup + cosine schedule + gradient clipping combination worked perfectly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No overfitting&lt;/strong&gt; — validation loss tracked training loss closely throughout. The gap widened slightly in epoch 2 (0.69 vs 0.92), which is expected and healthy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rapid initial learning&lt;/strong&gt; — the steepest loss drop happened in the first 200 steps of epoch 1, as the model quickly adapted to the legal domain's vocabulary and style.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diminishing returns in epoch 2&lt;/strong&gt; — most of the learning happened in epoch 1. Epoch 2 provided refinement but the marginal improvement was smaller.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Phase 3: The Production RAG Pipeline — Architecture, Sharding, &amp;amp; Serving
&lt;/h2&gt;

&lt;p&gt;A fine-tuned model knows &lt;em&gt;how to talk&lt;/em&gt; like a legal expert, but it doesn't &lt;em&gt;remember&lt;/em&gt; specific facts. When a lawyer asks "What does Section 34 of the Indian Trusts Act say?", a model might generate something that sounds legally plausible but is entirely fabricated.&lt;/p&gt;

&lt;p&gt;To solve this, I designed and built a production-grade, highly optimized &lt;strong&gt;RAG (Retrieval-Augmented Generation)&lt;/strong&gt; pipeline. This lookup mechanism allows our fine-tuned Qwen model to query a massive vector database of Indian law, extract the exact legal provisions, and generate answers strictly grounded in the source material with pinpoint citations.&lt;/p&gt;




&lt;h3&gt;
  
  
  3A. LoRA Adapter Merging
&lt;/h3&gt;

&lt;p&gt;Running a model with active LoRA weights in production adds computational overhead and complicates serving. To achieve maximum inference speed and simplify deployment, I mathematically blend the LoRA weights directly into the base Qwen-3 4B parameters:&lt;/p&gt;

&lt;p&gt;$$W_{\text{merged}} = W_{\text{base}} + \frac{\alpha}{r} (A \times B)$$&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Result&lt;/strong&gt;: Fused 144 adapter projection layers in exactly &lt;strong&gt;20.4 seconds&lt;/strong&gt;. The final standalone model (~7.5 GB in &lt;code&gt;bfloat16&lt;/code&gt; precision) was saved directly to the persistent Modal Volume.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  3B. Structure-Aware Legal Chunking
&lt;/h3&gt;

&lt;p&gt;Legal documents have natural, highly structured segmentations (articles, sections, subsections). Naive chunking (e.g., splitting every 500 characters blindly) splits legal clauses in half, completely ruining retrieval precision.&lt;/p&gt;

&lt;p&gt;I built a structure-aware chunking pipeline that parses the three source document types into structured chunks while preserving critical legal metadata mappings:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Constitution of India&lt;/strong&gt;: Split by Article bounds → &lt;strong&gt;468 chunks&lt;/strong&gt; (average 1,025 characters).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Central Acts&lt;/strong&gt;: Split recursively by Section bounds → &lt;strong&gt;23,152 chunks&lt;/strong&gt; (average 1,364 characters).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Supreme Court Judgments&lt;/strong&gt;: Split by structured paragraphs, with metadata headers (case title, citation, bench, year) prepended to each chunk → &lt;strong&gt;330,673 chunks&lt;/strong&gt; (average 4,756 characters).&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Output&lt;/strong&gt;: &lt;strong&gt;354,293 chunks&lt;/strong&gt; compiled into a single &lt;strong&gt;1.6 GB&lt;/strong&gt; file. Each chunk contains its text, &lt;code&gt;chunk_id&lt;/code&gt;, and a metadata dictionary mapping its original source attributes (e.g., &lt;code&gt;article_number&lt;/code&gt;, &lt;code&gt;act_title&lt;/code&gt;, &lt;code&gt;section&lt;/code&gt;, &lt;code&gt;case_title&lt;/code&gt;, &lt;code&gt;citation&lt;/code&gt;, &lt;code&gt;bench&lt;/code&gt;, &lt;code&gt;year&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  3C. Massively Parallel GPU Map-Reduce Embedding
&lt;/h3&gt;

&lt;p&gt;Generating vector embeddings for &lt;strong&gt;354,293&lt;/strong&gt; documents using a state-of-the-art multi-lingual model (&lt;strong&gt;BGE-M3&lt;/strong&gt;) would take days on a single machine. To solve this, I built a highly distributed &lt;strong&gt;Map-Reduce pipeline&lt;/strong&gt; using &lt;strong&gt;Modal&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;graph TD
    A[354,293 chunks] --&amp;gt; B[Coordinator Function]
    B --&amp;gt;|Split into 32 shards| C[Shard Inputs]

    C --&amp;gt;|Shard 0| D1[L4 GPU Worker 1]
    C --&amp;gt;|Shard 1| D2[L4 GPU Worker 2]
    C --&amp;gt;|...| D3[L4 GPU Worker ...]
    C --&amp;gt;|Shard 31| D4[L4 GPU Worker 32]

    D1 --&amp;gt;|Embed FP16| E1[11,000 vectors]
    D2 --&amp;gt;|Embed FP16| E2[11,000 vectors]
    D3 --&amp;gt;|Embed FP16| E3[... vectors]
    D4 --&amp;gt;|Embed FP16| E4[11,000 vectors]

    E1 --&amp;gt; F[Reduce / Concatenate]
    E2 --&amp;gt; F
    E3 --&amp;gt; F
    E4 --&amp;gt; F

    F --&amp;gt; G[(FAISS Index FlatIP &amp;lt;br&amp;gt; 354,293 x 1024)]
    F --&amp;gt; H[(SQLite chunk_lookup.db)]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The Map Phase&lt;/strong&gt;: The coordinator divides the 354K chunks into 32 shards (~11,000 chunks per shard). Modal automatically spins up &lt;strong&gt;32 parallel L4 GPU containers&lt;/strong&gt; in the cloud simultaneously.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pre-Caching &amp;amp; Instant Boot&lt;/strong&gt;: The BGE-M3 model weights are baked directly into the Docker image layer, bypassing HuggingFace downloads and enabling the GPU servers to boot instantly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FP16 Inference&lt;/strong&gt;: Each worker runs native PyTorch &lt;code&gt;float16&lt;/code&gt; inference over its 11,000 texts, generating normalized dense embeddings in a fraction of the time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Reduce Phase&lt;/strong&gt;: The coordinator gathers the 32 output matrices, concatenating them in chronological order into a single dense matrix of shape &lt;code&gt;(354293, 1024)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FAISS Index Compilation&lt;/strong&gt;: The combined embeddings are fed into a FAISS &lt;code&gt;IndexFlatIP&lt;/code&gt; (Cosine similarity) index and saved. Simultaneously, a SQLite lookup database is generated on the volume.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute Time&lt;/strong&gt;: The entire parallel sharding execution finished in &lt;strong&gt;under 20-30 minutes&lt;/strong&gt; of total wall time.&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  3D. Production FastAPI Serving &amp;amp; Optimizations
&lt;/h3&gt;

&lt;p&gt;To serve the RAG assistant, I built an extremely optimized FastAPI server hosted on Modal. It loads the merged Qwen model and BGE-M3 on a single cost-effective L4 GPU.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Zero-RAM SQLite Lookup Database (Startup Optimization)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Problem&lt;/strong&gt;: Reading the 1.6 GB chunk lookup JSON into container memory on boot takes almost &lt;strong&gt;2 minutes&lt;/strong&gt; and consumes &lt;strong&gt;1.6 GB of RAM&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Solution&lt;/strong&gt;: On first startup, the server streams the JSON file line-by-line and compiles a local SQLite database directly on the persistent volume (took 92.6s). On subsequent boots, the JSON is completely bypassed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Result&lt;/strong&gt;: The server opens a thread-safe SQLite connection instantly on boot (&lt;strong&gt;0.001 seconds&lt;/strong&gt;) and consumes &lt;strong&gt;0 MB of startup RAM overhead&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  2. VRAM Autocasting &amp;amp; Thread-Safe Real-time Streaming
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Autocasting&lt;/strong&gt;: Inside the generation thread, both token lookup and model generation are wrapped in &lt;code&gt;torch.inference_mode()&lt;/code&gt; and &lt;code&gt;torch.autocast(device_type="cuda", dtype=torch.bfloat16)&lt;/code&gt; to guarantee zero memory spikes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ASGI Protection&lt;/strong&gt;: Real-time token streaming is exposed via Server-Sent Events (SSE) at &lt;code&gt;/api/ask/stream&lt;/code&gt;. Because LLM token generation is CPU/GPU bound, running it synchronously inside an async FastAPI server freezes the async event loop. I wrapped the &lt;code&gt;TextIteratorStreamer&lt;/code&gt; inside a separate native OS &lt;code&gt;Thread&lt;/code&gt; and fed tokens into a synchronous streaming generator.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strict EOS Enforcement&lt;/strong&gt;: The system dynamically extracts the &lt;code&gt;&amp;lt;|im_end|&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;|endoftext|&amp;gt;&lt;/code&gt; token IDs at tokenizer boot to strictly enforce early stopping and prevent hallucinations.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  3. Absolute Cost Safety
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;The server uses &lt;code&gt;min_containers=0&lt;/code&gt;. When idle, it &lt;strong&gt;scales down to zero GPU containers&lt;/strong&gt;, costing &lt;strong&gt;exactly $0.00 in hosting fees&lt;/strong&gt;. Cold start boots in ~10 seconds.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  3E. Verification &amp;amp; End-to-End Test Results
&lt;/h3&gt;

&lt;p&gt;Both endpoints were verified against the active server. The results are spectacular and highly accurate:&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Blocking API Endpoint (&lt;code&gt;/api/ask&lt;/code&gt;)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Query&lt;/strong&gt;: &lt;em&gt;"What does Article 21 of the Indian Constitution guarantee?"&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Status&lt;/strong&gt;: &lt;code&gt;200 OK&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total Latency&lt;/strong&gt;: &lt;strong&gt;5.34 seconds&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generated Answer&lt;/strong&gt;:
&amp;gt; Article 21 guarantees the right to life and personal liberty. The Supreme Court has interpreted this right expansively, noting that it is not limited to mere survival but encompasses the right to live with dignity. This includes the right to privacy, which is viewed as an inalienable component of personal liberty.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sources Used&lt;/strong&gt;:

&lt;ol&gt;
&lt;li&gt;&lt;code&gt;[SC_JUDGMENTS] Supreme Court: K.S. Puttaswamy v. Union of India (2017)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;[SC_JUDGMENTS] Supreme Court: Common Cause v. Union of India (2017)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;[SC_JUDGMENTS] Supreme Court: X v. Union of India (2023)&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  2. Streaming API Endpoint (&lt;code&gt;/api/ask/stream&lt;/code&gt;)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Query&lt;/strong&gt;: &lt;em&gt;"What are the grounds for divorce under the Hindu Marriage Act?"&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Status&lt;/strong&gt;: &lt;code&gt;200 OK&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stream Event 1 (Metadata Block)&lt;/strong&gt;: Source citations with full case metadata&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stream Event 2+ (Word-by-Word Tokens)&lt;/strong&gt;: Real-time legal analysis streaming&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Phase 4A: The Full-Stack SaaS Product
&lt;/h2&gt;

&lt;p&gt;With a working RAG backend, I built a complete &lt;strong&gt;production-grade web application&lt;/strong&gt; — not a demo, not a Gradio wrapper, but a real SaaS product with authentication, streaming, and a premium UI.&lt;/p&gt;

&lt;h3&gt;
  
  
  Technology Stack
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Frontend Framework&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Next.js 16 (App Router)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Vercel (via CLI, no GitHub push required)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Styling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CSS Modules with custom design system&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fonts&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Playfair Display (headings) + Inter (body) via Google Fonts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Icons&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Lucide React&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Backend Proxy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Next.js API Routes → Modal FastAPI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Authentication&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Access-code gate with server-side cookie validation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Design System
&lt;/h3&gt;

&lt;p&gt;The UI uses a &lt;strong&gt;dark professional theme&lt;/strong&gt; with deep navy (&lt;code&gt;#0F1B2D&lt;/code&gt;) + gold (&lt;code&gt;#C89D4A&lt;/code&gt;) branding — deliberately chosen for extended legal research sessions. No glassmorphism. Minimal, authoritative, and clean.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Token&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;th&gt;Usage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--navy&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;#0F1B2D&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Primary background&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--navy-light&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;#162337&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Card/surface backgrounds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--gold&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;#C89D4A&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Accent, branding, active states&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--white&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;#EAEAEA&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Primary text&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--gray-300&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;#B0B8C1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Secondary text&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Architecture Flow
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;graph LR
    A[User Browser] --&amp;gt;|HTTPS| B[Vercel CDN]
    B --&amp;gt;|Auth Cookie| C[Next.js API Route]
    C --&amp;gt;|SSE Stream| D[Modal FastAPI]
    D --&amp;gt;|FAISS Query| E[(Vector Index)]
    D --&amp;gt;|SQLite Lookup| F[(Chunk DB)]
    D --&amp;gt;|Qwen-3 4B| G[L4 GPU]
    G --&amp;gt;|Token Stream| D
    D --&amp;gt;|SSE Response| C
    C --&amp;gt;|Stream to Client| A
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key Components
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Access Gate&lt;/strong&gt; — access-code authentication with server-side cookie validation. Protects the chat interface from unauthorized access.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Chat Interface&lt;/strong&gt; — real-time streaming chat with auto-scroll, message bubbles, and a loading state that progresses through multiple stages ("Searching 354,293 legal documents..." → "Analyzing relevant precedents..." → "Constructing legal context..." → "Warming up GPU inference engine...").&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Citation Cards&lt;/strong&gt; — each source citation is rendered as an expandable card showing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Source type badge (SC Judgment / Central Act / Constitution)&lt;/li&gt;
&lt;li&gt;Case title (with intelligent fallback extraction from chunk text)&lt;/li&gt;
&lt;li&gt;Year, citation number, bench composition&lt;/li&gt;
&lt;li&gt;Full metadata grid when expanded&lt;/li&gt;
&lt;li&gt;Actual chunk text (first 300 characters)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Collapsible Citations&lt;/strong&gt; — citations are grouped by source type with a summary bar: &lt;em&gt;"6 SC Judgments · 1 Constitution · 1 Central Act"&lt;/em&gt;. Collapsed by default to keep the focus on the answer.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Confidence Bar&lt;/strong&gt; — displays: &lt;em&gt;"✓ 8 sources retrieved · Avg relevance: 87%"&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sample Prompts&lt;/strong&gt; — curated legal questions on the empty state, tuned for strong demo performance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Branded Favicon&lt;/strong&gt; — custom SVG: gold "N" monogram with balanced scales of justice on deep navy background.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Phase 4B: The Refinement Layer — Production-Grade Tuning
&lt;/h2&gt;

&lt;p&gt;After the initial deployment, I systematically addressed every production issue. This phase was the difference between "it works" and "it works &lt;em&gt;well&lt;/em&gt;."&lt;/p&gt;

&lt;h3&gt;
  
  
  System Prompt Engineering
&lt;/h3&gt;

&lt;p&gt;The original system prompt was 5 generic lines. I rewrote it into a &lt;strong&gt;20-line structured instruction set&lt;/strong&gt; that forces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Exact case name citations (no "Supreme Court Judgment" placeholders)&lt;/li&gt;
&lt;li&gt;Chronological ordering for historical/evolution queries&lt;/li&gt;
&lt;li&gt;Bullet points for distinct legal holdings&lt;/li&gt;
&lt;li&gt;No repetition across paragraphs&lt;/li&gt;
&lt;li&gt;Senior legal researcher tone&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Hierarchical Response Modes
&lt;/h3&gt;

&lt;p&gt;A critical learning: the model was fine-tuned on responses with a &lt;strong&gt;median length of 110 words&lt;/strong&gt;. It learned to hit EOS (end of sentence) at ~150-180 tokens regardless of &lt;code&gt;max_new_tokens&lt;/code&gt;. The system prompt alone couldn't override this trained behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The solution: hierarchical prompting with three modes.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;&lt;code&gt;min_new_tokens&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;System Prompt Instruction&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;⚡ &lt;strong&gt;Concise&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;0 (no floor)&lt;/td&gt;
&lt;td&gt;"Brief, direct answer in 2-4 sentences"&lt;/td&gt;
&lt;td&gt;Quick factual lookups&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;📖 &lt;strong&gt;Detailed&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;150&lt;/td&gt;
&lt;td&gt;"Detailed analysis with case references, chronological ordering"&lt;/td&gt;
&lt;td&gt;Standard legal questions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🎓 &lt;strong&gt;Research&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;350&lt;/td&gt;
&lt;td&gt;"Full legal research memo: case-by-case breakdown, reasoning, holdings, evolution, current position"&lt;/td&gt;
&lt;td&gt;Deep analysis, investor demos&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each mode dynamically adjusts both the system prompt AND the &lt;code&gt;min_new_tokens&lt;/code&gt; parameter in &lt;code&gt;model.generate()&lt;/code&gt;. The user sees three pill buttons above the input field.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The insight&lt;/strong&gt;: &lt;code&gt;max_new_tokens&lt;/code&gt; is a ceiling, not a target. It says "generate at most this many tokens." But the model stops when it hits an EOS token. &lt;code&gt;min_new_tokens&lt;/code&gt; tells the model: "you cannot stop generating until you've produced at least N tokens." Combined with a structured prompt that asks for detailed analysis, the model fills those extra tokens with actual substance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Source-Aware Retrieval Routing
&lt;/h3&gt;

&lt;p&gt;The original RAG pipeline returned the top-k nearest vectors from FAISS regardless of query intent. If you asked about Article 21 (Constitution), you might get 8 SC Judgment chunks and zero Constitution chunks — because judgment text is more verbose and often embeds better.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix: &lt;code&gt;_enforce_source_diversity()&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Query intent detection&lt;/strong&gt; — regex-based analysis detects if the query targets Constitution articles, Central Acts, or SC Judgments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Over-retrieval&lt;/strong&gt; — FAISS retrieves &lt;code&gt;top_k * 2&lt;/code&gt; candidates (16 instead of 8)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intelligent reranking&lt;/strong&gt; — if the query targets Constitution but results are all judgments, Constitution chunks are boosted in the reranking&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Metadata-Grounded Citation Cards
&lt;/h3&gt;

&lt;p&gt;A persistent bug: citation cards showed "Supreme Court Judgment" instead of the actual case title (e.g., "Pritam Singh v. The State"). The &lt;code&gt;case_title&lt;/code&gt; metadata was sometimes missing from older chunks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Two-layer fallback:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Backend now sends &lt;code&gt;text_chunk&lt;/code&gt; (first 300 characters of chunk text) in the streaming sources payload&lt;/li&gt;
&lt;li&gt;Frontend extracts the case title from the chunk text using regex: &lt;code&gt;"Supreme Court of India — CASE TITLE"&lt;/code&gt; → &lt;code&gt;"CASE TITLE"&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Comprehensive Modal Logging
&lt;/h3&gt;

&lt;p&gt;Every inference request now logs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User query and parameters&lt;/li&gt;
&lt;li&gt;Response mode and min_tokens configuration&lt;/li&gt;
&lt;li&gt;FAISS retrieval distances and case titles for each chunk&lt;/li&gt;
&lt;li&gt;Source routing decisions&lt;/li&gt;
&lt;li&gt;Prompt token count&lt;/li&gt;
&lt;li&gt;Full model output text&lt;/li&gt;
&lt;li&gt;Per-stage latency (retrieval, generation, total)&lt;/li&gt;
&lt;li&gt;Generated token count&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All visible in the Modal dashboard for live monitoring.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mobile Responsive Design
&lt;/h3&gt;

&lt;p&gt;Citation cards that were visually dominant on mobile screens were redesigned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compact padding and smaller text&lt;/li&gt;
&lt;li&gt;Single-column metadata grid (instead of 2-column)&lt;/li&gt;
&lt;li&gt;Scrollable chunk text with &lt;code&gt;max-height: 200px&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Source badges at reduced size&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Infrastructure Costs
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;GPU&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Data Processing&lt;/td&gt;
&lt;td&gt;CPU only&lt;/td&gt;
&lt;td&gt;~2 hours&lt;/td&gt;
&lt;td&gt;$0 (local)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Synthetic QA Generation&lt;/td&gt;
&lt;td&gt;None (API)&lt;/td&gt;
&lt;td&gt;~6 hours&lt;/td&gt;
&lt;td&gt;$0 (Gemini free tier)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fine-Tuning (2 epochs)&lt;/td&gt;
&lt;td&gt;L4 (24GB)&lt;/td&gt;
&lt;td&gt;~2 hours&lt;/td&gt;
&lt;td&gt;~$3-5 (Modal)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embedding (32x sharded)&lt;/td&gt;
&lt;td&gt;L4 × 32&lt;/td&gt;
&lt;td&gt;~30 min&lt;/td&gt;
&lt;td&gt;~$3-4 (Modal)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frontend Hosting&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;$0 (Vercel free tier)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backend Hosting (idle)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;$0 (Modal scales to zero)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt; $10&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Read that again. The entire pipeline — from raw legal text to a fine-tuned 4B parameter model with RAG, streaming, and a production SaaS frontend — cost less than a meal.&lt;/p&gt;




&lt;h2&gt;
  
  
  Technical Specs Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Specification&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Base Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Qwen-3 4B Instruct (2507)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Merged Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Standing &lt;code&gt;bfloat16&lt;/code&gt; standalone weights (~7.5 GB)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Embedding Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;BAAI/bge-m3 (dense vector, FP16 precision)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FAISS Vector Index&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;IndexFlatIP (Cosine Similarity, 1024 dimensions)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total Database Chunks&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;354,293 chunks (1.6 GB corpus)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lookup Engine&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Thread-safe local SQLite database&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Server Framework&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;FastAPI (with SSE token streaming)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Concurrence Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native multi-thread worker with TextIteratorStreamer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API Endpoints&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;/api/ask&lt;/code&gt; (Blocking), &lt;code&gt;/api/ask/stream&lt;/code&gt; (SSE Streaming)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Response Modes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Concise / Detailed / Research (hierarchical prompting)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Frontend&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Next.js 16 (App Router) on Vercel&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Authentication&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Access-code gate with server-side cookies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hosting Platform&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Modal (Backend) + Vercel (Frontend)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU Target&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;NVIDIA L4 (24GB VRAM)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Production Scale&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;min_containers=0&lt;/code&gt; (Scales to zero when idle for $0.00/hr)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;E2E Average Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~5 seconds for full answer / real-time for streaming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total Build Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt; $10&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Data Quality &amp;gt; Data Quantity
&lt;/h3&gt;

&lt;p&gt;4,000 carefully structured instruction pairs, generated from real legal text with strict anti-hallucination prompting, taught the model more than 50,000 sloppy pairs would have. The key was enforcing diversity in both task type (summaries, comparisons, Q&amp;amp;A, yes/no) and length (1 sentence to 4 paragraphs).&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Training Data Distribution Dictates Model Behavior
&lt;/h3&gt;

&lt;p&gt;The model's output length is not controlled by &lt;code&gt;max_new_tokens&lt;/code&gt; — it's dictated by the &lt;strong&gt;distribution of response lengths in the training data&lt;/strong&gt;. With a median training response of 110 words, the model consistently hits EOS at ~150-180 tokens. The fix isn't bigger &lt;code&gt;max_new_tokens&lt;/code&gt; — it's either retraining with longer responses or using &lt;code&gt;min_new_tokens&lt;/code&gt; with structured prompts.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Hierarchical Prompting is High ROI
&lt;/h3&gt;

&lt;p&gt;Instead of a one-size-fits-all prompt, implementing response modes (Concise/Detailed/Research) with mode-specific system prompts and &lt;code&gt;min_new_tokens&lt;/code&gt; floors gives users control over response depth. This was suggested during product critique and turned out to be the single highest-ROI improvement for user experience.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Source Diversity Matters More Than Raw Similarity
&lt;/h3&gt;

&lt;p&gt;FAISS returns the most semantically similar chunks, but similarity ≠ utility. A Constitution query returning 8 judgment chunks (because judgments embed better) is technically correct but practically useless. Source-aware reranking that considers query intent dramatically improves answer quality.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Standalone Merged Models are Faster and Cleaner
&lt;/h3&gt;

&lt;p&gt;Merging the LoRA weights directly into the base parameters completely eliminated inference-time adapter overhead, trimmed memory footprints, and allowed the base model to load at peak native speeds.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Bypass JSON in Production with SQLite
&lt;/h3&gt;

&lt;p&gt;Loading large JSON files (1.6GB+) is a silent killer for cloud instances. SQLite dropped boot overhead from 2 minutes to &lt;strong&gt;0.001 seconds&lt;/strong&gt; while consuming &lt;strong&gt;0 MB of startup RAM&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. GPU Sharding for Rapid Large-Scale Embeddings
&lt;/h3&gt;

&lt;p&gt;Attempting to embed 354,000+ texts sequentially is a nightmare. 32 parallel L4 GPUs via Modal allowed us to embed the entire dataset in &lt;strong&gt;~20 minutes&lt;/strong&gt; for under a few dollars.&lt;/p&gt;

&lt;h3&gt;
  
  
  8. Always Scale to Zero when Idle
&lt;/h3&gt;

&lt;p&gt;For bootstrapped startups, &lt;code&gt;min_containers=0&lt;/code&gt; on serverless providers like Modal allows hosting a fully functional RAG prototype completely &lt;strong&gt;free of charge&lt;/strong&gt; when idle.&lt;/p&gt;

&lt;h3&gt;
  
  
  9. Domain Infrastructure Beats General Intelligence
&lt;/h3&gt;

&lt;p&gt;A general-purpose LLM is broad intelligence. NyayAI is domain infrastructure for Indian law. That's similar to how Bloomberg exists despite Google, or how Westlaw exists despite search engines. The value comes from Indian legal corpus specialization, retrieval grounding, citation accuracy, jurisprudence-focused indexing, and workflow optimization for lawyers.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Stack
&lt;/h2&gt;

&lt;p&gt;PyTorch · Modal · Qwen-3 4B · FAISS · BGE-M3 · SQLite · FastAPI · Next.js 16 · Vercel · Server-Sent Events · LoRA · Cosine Similarity&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Already Built (75/100)
&lt;/h2&gt;

&lt;p&gt;✅ Data acquisition · ✅ Cleaning · ✅ Chunking · ✅ Embeddings · ✅ Retrieval · ✅ Serving · ✅ Deployment · ✅ Fine-tuning · ✅ Streaming · ✅ Grounding · ✅ Systems optimization · ✅ UX · ✅ Source routing · ✅ Hierarchical prompting · ✅ Citation metadata · ✅ Frontend SaaS&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next (25/100)
&lt;/h2&gt;

&lt;p&gt;🔲 Trust · 🔲 Distribution · 🔲 Onboarding · 🔲 User retention · 🔲 Legal partnerships · 🔲 Monetization · 🔲 Sales · 🔲 Adoption loops · 🔲 Reliability · 🔲 Consistency · 🔲 Multilingual access · 🔲 High Court coverage&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built with obsession by a solo founder who believes every Indian deserves access to justice — and that the right AI can make that happen.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;© 2026 Ashish Raj. All rights reserved.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>rag</category>
      <category>law</category>
    </item>
  </channel>
</rss>
