DEV Community: Tilak Raj

Why I Switched From GPT-4 to Small Language Models for Two of My Products

Tilak Raj — Wed, 25 Mar 2026 22:27:16 +0000

GPT-4 and Claude Sonnet are not always the right model for the job. After 18 months of running AI products in production, I've moved two of my products from frontier models to small language models — and the results have been better latency, lower cost, and in one case, higher accuracy on the specific task. Here is exactly what I did and why.

Background: The Two Products That Changed

Product 1: AgriIntel — Crop recommendation classification

AgriIntel uses AI to classify incoming sensor data events and route them to the appropriate recommendation workflow.

The classification task is:

Given a set of sensor readings (soil moisture, temperature, nutrient levels, weather forecast), classify what type of agronomic decision is needed:

Irrigation
Fertilization
Pest management
Harvest timing
No action

This is a classification task with a fixed taxonomy. GPT-4o was doing it well — but at $0.005 per classification, at 15,000+ classifications per day, the cost was significant.

Latency was also 800ms–1.2s for a task where users expect near-instant feedback.

Product 2: CanadaCompliance — Regulation change impact classification

CanadaCompliance.ai monitors regulatory changes and classifies each change by:

Industry sector affected
Type of obligation (new requirement, amendment, repeal)
Urgency level (immediate action, planning horizon, informational)

Again — fixed taxonomy classification with high volume.

Why Small Language Models Made Sense

The key insight:

Frontier models are optimized for general capability. For specific classification tasks, that capability is overkill — and you pay for it in cost and latency.

Small language models (Phi-3, Mistral 7B, Llama 3.2) are:

Much faster (50–200ms vs 800ms–2s)
Much cheaper (10–100× lower cost)
Fine-tuneable to specific tasks
Privately hostable for data residency needs

The Fine-Tuning Process for AgriIntel

Step 1: Build training dataset

I generated a labeled dataset using GPT-4o — using the model I was replacing to label 3,000 examples.

This is a common pattern:
Use a strong model to generate training data for a smaller model.

Example workflow:

Generate labeled examples
Format JSONL dataset
Prepare training pipeline

Step 2: Fine-tune the model

I fine-tuned GPT-4o-mini using OpenAI’s fine-tuning API.

Why GPT-4o-mini?

It is smaller, cheaper, and performs better on specialized tasks while keeping OpenAI API simplicity.

Step 3: Benchmark results

Before switching production traffic, I tested both models on a 500-example dataset:

Results:

GPT-4o:

Accuracy: 96.2%
Latency: 1100ms
Cost: $0.0048 per call

Fine-tuned GPT-4o-mini:

Accuracy: 97.1%
Latency: 280ms
Cost: $0.00048 per call

Improvements:

Cost reduction: 90%
Latency reduction: 75%
Accuracy improvement: +0.9%

Why the Fine-Tuned Model Performed Better

GPT-4o tries to be helpful and nuanced, which sometimes adds unnecessary complexity.

The fine-tuned model learned:

Exact taxonomy
Expected output structure
Domain edge cases

For structured classification tasks, precision beats general capability.

Fine-tuning teaches the model:
How to apply knowledge to your specific domain.

When NOT to Use Small Language Models

This approach does NOT work for:

Open-ended generation (reports, documents)
Complex reasoning tasks
Low-volume workloads
Rapidly changing taxonomies

Use frontier models when flexibility matters more than cost.

Decision Framework

Use fine-tuned SLM when:

Volume > 1,000 calls/day
Fixed taxonomy
Stable task definition
Latency matters
Cost matters
You have training data

Use frontier models when:

Volume is low
Task requires reasoning
Task changes frequently
No training data exists
Output quality variance is risky

Results Summary

AgriIntel improvements:

Cost reduction: 90%
Latency reduction: 75%
Accuracy improvement: +0.9%

Monthly savings:
$3,100/month (~$37,000/year)

About the Author

Tilak Raj is CEO & Founder of Brainfy AI, building vertical AI SaaS products across agriculture, insurance, aviation compliance, and real estate.

Website:
https://tilakraj.info

Projects:
https://tilakraj.info/projects

Building AI Agents That Actually Work in Production: My Technical Approach

Tilak Raj — Wed, 25 Mar 2026 22:24:59 +0000

Building an AI agent that works in a demo is easy. Building one that works reliably in production is a completely different engineering challenge.

Production systems must handle real users, real data, and real consequences when things fail.

This is the production agent architecture I use across Brainfy AI and Navlyt, along with real code patterns and failure modes I design around.

What Makes Production Agents Different From Demo Agents

Demo agents optimize for the happy path.

Production agents must handle:

Real data variance
Production inputs are messy, ambiguous, and full of edge cases.
Concurrent executions
Multiple agent instances running simultaneously with shared state.
Long-running tasks
Agents that may take minutes or hours requiring durable execution state.
Cost management
Confused agents making unnecessary tool calls can become expensive quickly.
Observability
You must understand exactly what the agent decided and why.

The Core Architecture: Durable Agent State

The most important production decision:

Keep agent state in a database — not in memory.

In-memory state:

Dies with the server
Cannot scale horizontally
Cannot be audited

Database state:

Survives restarts
Enables horizontal scaling
Provides observability
Enables debugging

Example schema:

-- Agent execution state table

CREATE TABLE agent_executions (

 id UUID DEFAULT gen_random_uuid() PRIMARY KEY,

 user_id UUID REFERENCES auth.users NOT NULL,

 agent_type TEXT NOT NULL,

 status TEXT NOT NULL DEFAULT 'pending',

 CONSTRAINT valid_status CHECK (
   status IN (
     'pending',
     'running',
     'completed',
     'failed',
     'cancelled',
     'awaiting_review'
   )
 ),

 input_data JSONB NOT NULL,

 state JSONB DEFAULT '{}',

 result JSONB,

 error TEXT,

 step_count INTEGER DEFAULT 0,

 token_count INTEGER DEFAULT 0,

 created_at TIMESTAMPTZ DEFAULT NOW(),

 updated_at TIMESTAMPTZ DEFAULT NOW(),

 completed_at TIMESTAMPTZ

);

-- Tool call log for observability

CREATE TABLE agent_tool_calls (

 id UUID DEFAULT gen_random_uuid() PRIMARY KEY,

 execution_id UUID REFERENCES agent_executions NOT NULL,

 step_number INTEGER NOT NULL,

 tool_name TEXT NOT NULL,

 tool_input JSONB NOT NULL,

 tool_output JSONB,

 status TEXT NOT NULL DEFAULT 'pending',

 latency_ms INTEGER,

 error TEXT,

 called_at TIMESTAMPTZ DEFAULT NOW()

);

The Agent Loop With Production Safeguards

Production agents need hard limits.

Example safeguards:

Step limits
Token limits
Timeout limits
Failure conditions

Example TypeScript loop:

// lib/agents/production-agent.ts

const AGENT_LIMITS = {

 maxSteps: 25,

 maxTokens: 50_000,

 stepTimeoutMs: 30_000,

 totalTimeoutMs: 300_000

}

export async function runAgent(

 executionId: string,

 supabase: SupabaseClient

): Promise<void> {

 const startTime = Date.now()

 let execution = await loadExecution(
   executionId,
   supabase
 )

 await updateStatus(
   executionId,
   'running',
   supabase
 )

 while (true) {

   const elapsed =
     Date.now() - startTime

   if (execution.step_count >= AGENT_LIMITS.maxSteps){

     await failWithReason(
       executionId,
       'MAX_STEPS_EXCEEDED',
       supabase
     )

     return
   }

   if (execution.token_count >= AGENT_LIMITS.maxTokens){

     await failWithReason(
       executionId,
       'MAX_TOKENS_EXCEEDED',
       supabase
     )

     return
   }

   if (elapsed >= AGENT_LIMITS.totalTimeoutMs){

     await failWithReason(
       executionId,
       'TOTAL_TIMEOUT',
       supabase
     )

     return
   }

   const response =
     await callModel(messages, TOOLS)

   execution.step_count++

   execution.token_count +=
     response.usage?.total_tokens ?? 0

   await persistState(
     executionId,
     execution,
     supabase
   )

}

The Human-in-the-Loop Gate

For actions that are difficult to reverse, I require human approval.

The agent:

Prepares the action
Sets status to awaiting_review
Stops execution
Waits for approval

Example:

const APPROVAL_REQUIRED_TOOLS = [

 'send_email',

 'update_customer_record',

 'generate_compliance_document',

 'submit_to_regulator'

]

async function executeToolCall(

 toolCall,

 executionId,

 supabase

){

 if(APPROVAL_REQUIRED_TOOLS.includes(name)){

   await updateStatus(
     executionId,
     'awaiting_review',
     supabase
   )

   throw new AgentPausedError(
     'Human approval required'
   )
 }

 return await callTool(name,args)

}

Monitoring: What I Track in Production

Metrics I monitor:

Step efficiency
Tool success rate
Human review escalation rate
Token cost per completion
Completion rate

Example health query:

const { data } =
await supabase.rpc(

 'agent_health_metrics',

 {

   agent_type:
     'compliance_document_generator',

   since:
     new Date(
       Date.now() -
       7 * 24 * 60 * 60 * 1000
     ).toISOString()

 }
)

Typical results:

Completion rate: 94%
Avg steps: 8.3
Human review rate: 3.1%

Key Lessons

Production agents require:

Durable state
Hard execution limits
Observability
Cost controls
Human approval gates

Most failures come from missing safeguards, not model quality.

About the Author

Tilak Raj
Founder & CEO — Brainfy AI

Building vertical AI SaaS across compliance, real estate, agriculture, and aviation.

Website: https://www.tilakraj.info

Projects: https://www.tilakraj.info/projects

Questions about production agents? Drop a comment — I reply to all of them.

Compound AI Systems: How I Connect Multiple Models in a Single Production Product

Tilak Raj — Wed, 25 Mar 2026 22:22:06 +0000

Why Single-Model AI Is Not Enough

Single-model AI calls are increasingly insufficient for production AI products.

The most capable AI systems today combine multiple models, retrievers, validators, and tools working together.

This is the compound AI architecture I've settled on after building across multiple production products, along with real patterns from systems that have shipped.

What Is a Compound AI System?

A compound AI system routes different parts of a task to the most appropriate component instead of sending everything to a single model.

These components typically include:

Multiple language models (different models for different subtasks)
Retrieval systems (vector databases, search, structured queries)
Code executors (data analysis, calculations, transformations)
External tool calls (APIs, databases, file systems)
Validation and checking components

The orchestration layer decides:

Which components handle each part of the task
How context flows between components
How outputs are combined into a final response

The Architecture I Use: Orchestrator + Specialist Pattern

Across my products, I've found the orchestrator + specialist pattern to be the most reliable compound architecture.

Orchestrator

A planning model that:

Receives the full task
Breaks it into subtasks
Decides which specialist handles each subtask

Typical models I use:

GPT-4o
Claude Sonnet

Specialists

Purpose-built components for specific subtasks.

These may include:

AI models
Deterministic backend code
Retrieval systems
Processing pipelines

Validator

A lightweight checking component that:

Validates outputs
Prevents hallucinations
Ensures format correctness
Confirms requirements before returning results

Example TypeScript Architecture

Here is a simplified version of how I structure compound AI orchestration.

// types/compound-ai.ts

interface Task {
  id: string;
  input: string;
  context: Record<string, unknown>;
  requiredOutputType: string;
}

interface SubTask {
  id: string;
  parentTaskId: string;
  description: string;
  specialistType: SpecialistType;
  input: string;
  dependsOn: string[];
}

type SpecialistType =
  | "rag_retrieval"
  | "document_extraction"
  | "compliance_check"
  | "draft_generation"
  | "validation"
  | "code_execution"
  | "structured_extraction";

interface SpecialistResult {
  subTaskId: string;
  result: string;
  confidenceScore?: number;
}

Why This Architecture Works

This pattern works because it mirrors how real engineering systems scale:

Instead of forcing one model to do everything, you:

Break problems into smaller parts
Assign the right tool to each task
Validate before merging results
Keep orchestration logic separate

This dramatically improves:

Reliability
Cost efficiency
Latency
Output quality

Key Lesson

The biggest improvement in AI systems doesn't come from better prompts.

It comes from better architecture.

The teams that win with AI products are not the ones using the newest models.

They are the ones building repeatable compound systems that combine models, tools, and validation layers effectively.

About Me

Tilak Raj
Founder & CEO — Brainfy AI

Building vertical AI SaaS across compliance, real estate, agriculture, and aviation.

Website: https://www.tilakraj.info
Projects: https://www.tilakraj.info/projects

Next.js + Supabase + OpenAI. The exact stack I use to ship AI SaaS in 30 days

Tilak Raj — Wed, 25 Mar 2026 22:14:50 +0000

Next.js + Supabase + OpenAI. The exact stack I use to ship AI SaaS in 30 days

I have shipped 8 production AI SaaS products using this stack. This is not a beginner tutorial. This is the production architecture I use after learning what fails at scale.

Why this stack

Next.js App Router
Server components remove data fetching complexity. API routes stay close to features. TypeScript everywhere. Simple Vercel deployment.

Supabase
PostgreSQL. Auth. Storage. Realtime. Row Level Security. One platform instead of five services.

OpenAI
Reliable API. Structured outputs. Function calling. I also use Claude and other models but OpenAI patterns remain my baseline.

The main reason is familiarity. I know the failure points and scaling limits. That removes decision overhead.

Project structure

my-ai-saas/

app/
 (auth)/
   login/page.tsx
   signup/page.tsx

 (dashboard)/
   layout.tsx
   page.tsx

 [feature]/
   page.tsx
   _components/

 api/
   ai/
     generate/route.ts
     stream/route.ts

   webhooks/
     stripe/route.ts

lib/
 supabase/
   client.ts
   server.ts

 ai/
   client.ts
   prompts/

types/
 database.types.ts
 api.types.ts

supabase/
 migrations/
 seed.sql

Supabase patterns that matter

Row level security from day one

Every table with user data gets RLS before first production data.

ALTER TABLE documents ENABLE ROW LEVEL SECURITY;

CREATE POLICY users_select_own_documents
ON documents
FOR SELECT
USING (auth.uid() = user_id);

This prevents future security incidents.

Server client pattern

Different Supabase clients for server and browser.

export async function createClient() {

 const cookieStore = await cookies()

 return createServerClient(
  process.env.NEXT_PUBLIC_SUPABASE_URL,
  process.env.NEXT_PUBLIC_SUPABASE_ANON_KEY,
  {
   cookies:{
    getAll(){
     return cookieStore.getAll()
    }
   }
  }
 )
}

OpenAI patterns I always use

Structured outputs

Never parse text manually. Always validate with schema.

const ClaimDataSchema = z.object({

 claimant_name:z.string(),
 policy_number:z.string(),
 date_of_loss:z.string(),
 description:z.string()

})

This removed parsing bugs across products.

Streaming responses

If generation takes more than 2 seconds I stream.

Users prefer progressive output instead of waiting.

Authentication flow

Supabase middleware protects dashboard routes.

if(!user &&
request.nextUrl.pathname.startsWith('/dashboard')){

 return NextResponse.redirect(
  new URL('/login',request.url)
 )

}

Simple and repeatable across products.

Production checklist before launch

Before every launch I verify.

RLS enabled on every table.
Environment variables secured.
React error boundaries added.
AI output validation enabled.
Rate limiting on AI routes.
Logging for prompts and tokens.
Database indexes on foreign keys.

Key lesson

Speed comes from repeatable architecture. Not from chasing new tools.

Using the same stack across products lets me ship faster and with fewer mistakes.

About me

Tilak Raj
Founder and CEO of Brainfy AI
Building vertical AI SaaS across compliance, real estate, agriculture, and aviation.

Website
https://www.tilakraj.info

Projects
https://www.tilakraj.info/projects

How I Built an AI Compliance System for Charter Aviation Using RAG and Pinecone

Tilak Raj — Wed, 25 Mar 2026 22:06:30 +0000

How I Built an AI Compliance System for Charter Aviation Using RAG and Pinecone

Building RAG for compliance-critical domains is not the same as building RAG for a general-purpose chatbot. When the outputs affect an operator's certification status, wrong answers have real consequences.

This is the full architecture walkthrough of Navlyt — the compliance operating system I built for charter aviation operators.

Navlyt tracks FAA, Transport Canada, and EASA regulatory requirements for small charter aviation operators. It answers compliance questions, monitors obligation status, and generates required documentation.

The challenge: regulatory documents are dense, cross-referenced, and version-controlled in ways that break standard RAG approaches.

This article covers the complete technical implementation — chunking strategy, retrieval architecture, answer generation with citations, and the accuracy validation approach I use for a compliance-critical domain.

The Problem With Standard RAG for Regulatory Content

Regulatory documents have properties that make standard paragraph-level chunking produce poor retrieval results:

Cross-references
A requirement in one section may reference definitions in another. A chunk containing only the requirement produces incomplete context.

Applicability conditions
Whether a regulation applies depends on conditions defined elsewhere. Standard chunking separates requirements from applicability criteria.

Version control
Regulatory documents are amended over time. Retrieval must be version-aware.

Term definitions
Regulatory language uses precise definitions. Example: Air taxi has a specific legal meaning.

The Chunking Strategy

After significant experimentation, I settled on a four-tier chunking approach:

Tier 1 — Section level chunks

Complete sections defining terms or applicability remain intact (200-800 tokens).

Tier 2 — Paragraph level chunks

Individual requirements are chunked at paragraph level with metadata:

Section number
Regulation name
Version
Applicability category

Tier 3 — Manual summary chunks

Some requirements span multiple sections.

For the most queried requirements I created manual summary chunks combining relevant provisions.

Expensive — but critical for accuracy.

Tier 4 — Cross reference chunks

For chunks with cross-references I create composite chunks including referenced content.

This removes the most common failure:
retrieving a rule without its definition.

Pinecone Index Architecture

I use a single Pinecone index with namespace separation by regulation type.

Example:
A Transport Canada operator asking about pilot currency does not need FAR Part 135 results.

const NAMESPACES = {
 transport_canada:'tc_cars',
 faa_part_135:'faa_135',
 faa_part_91:'faa_91',
 easa_cs23:'easa_cs23',
 operator_specific:'ops_spec'
}

async function retrieveCompliance(
 query:string,
 operatorContext:OperatorContext
){
 const targetNamespaces =
 resolveApplicableNamespaces(operatorContext)

 const queryEmbedding =
 await embedQuery(query)

 const results = await Promise.all(
  targetNamespaces.map(ns =>
   pinecone
    .index('navlyt-regulations')
    .namespace(ns)
    .query({
     vector:queryEmbedding,
     topK:5,
     includeMetadata:true,
     filter:{
      is_current:{'$eq':true},
      applicability_categories:{
       '$in':
       operatorContext.certificateCategories
      }
     }
    })
  )
 )

 return mergeAndRerankResults(results,query)
}

The Answer Generation Pipeline

Compliance RAG differs from normal RAG.

Every answer must:

Cite regulatory provisions
State applicability conditions
Flag ambiguity
Never speculate
Admit uncertainty

const COMPLIANCE_SYSTEM_PROMPT = `
You are a regulatory compliance assistant.

RULES:

1 Cite regulation sections
2 If unclear say regulations do not clearly address this
3 State applicability
4 Never speculate
5 Flag ambiguity
6 Verify requirements with Transport Canada
`

Accuracy Validation

Standard RAG metrics are insufficient.

I built a regulatory validation framework:

Human expert validation

Worked with a Transport Canada aviation consultant.

Built a 200 question validation set.

Current accuracy: 94.2%

Confidence scoring

Based on:

Retrieval similarity
Direct relevance
Regulatory ambiguity

Human review triggers

Automatic review when:

Confidence < 0.75
Regulations unclear
Recently amended sections

interface ComplianceAnswer{

 answer:string

 citations:RegulatoryCitation[]

 confidence:number

 requiresHumanReview:boolean

 applicabilityNote?:string

 ambiguityWarning?:string

 lastRegUpdateCheck:string
}

Key Lessons For Building Compliance RAG

Lesson 1 — Domain experts are mandatory

Could not build this without aviation compliance experts.

Budget for this.

Lesson 2 — Chunk quality matters most

Biggest gains came from improving chunk quality.

Not embedding models.

Lesson 3 — "I don't know" is correct sometimes

Wrong confident answers are dangerous.

Build strong non-answer logic.

Lesson 4 — Regulations require maintenance

Regulations change constantly.

Corpus updates must be part of the system.

Results

Accuracy: 94.2%
Latency: 1.8s
Human review rate: 6.3%

Navlyt is live at navlyt.com

More architecture writing:
tilakraj.info/blog

About the Author

Tilak Raj is CEO & Founder of Brainfy AI.

Building vertical AI SaaS across:

Agriculture
Insurance
Aviation compliance
Real estate

Shipped 8 AI products.

Writing about AI engineering and SaaS architecture.

Dev.to: dev.to/tilakraj
Website: tilakraj.info