DEV Community: Priyanshu

Why Groq Feels Like Cheating

Priyanshu — Tue, 23 Jun 2026 06:44:52 +0000

I've been building a multi-agent LangGraph pipeline recently, and like most people stitching together free-tier LLM providers, I ended up comparing Groq against the usual suspects. The difference wasn't subtle. Other providers felt like a normal API call — you send a request, you wait, text streams back at a pace you've gotten used to. Groq felt like cheating. A 70B-parameter model returning a full response faster than I could finish reading the prompt I'd just typed.

My first assumption was the boring one: Groq must just have better GPUs, or more of them, or some clever load balancing trick. That assumption is wrong, and the actual answer is a lot more interesting than "more compute."

Groq doesn't run GPUs at all. They built their own chip from scratch, called the LPU — Language Processing Unit — and the entire thing exists because of a bet that the way GPUs are designed is fundamentally mismatched to what LLM inference actually needs.

The wrong intuition: "just better hardware"

It's worth sitting with why "better GPUs" feels like the obvious answer. Every AI performance story for the last five years has been a GPU story — bigger clusters, more H100s, better interconnects. Nvidia's dominance is so total that "AI chip" and "GPU" have basically become synonymous in casual conversation.

But GPUs were never actually designed for this job. They were designed for graphics rendering, then repurposed for deep learning because their parallel architecture happened to suit matrix multiplication well — which is most of what neural networks are made of. That repurposing worked brilliantly for training, where you're crunching enormous batches of data and raw parallel throughput is what matters.

Inference, especially single-request, real-time token generation, is a different problem. And it turns out GPUs carry a lot of training-oriented baggage that actively works against them here.

The actual bottleneck: the memory wall

The real constraint in LLM inference isn't how many calculations a chip can do — it's how fast it can feed those calculations with data. This is sometimes called the "memory wall," and it's the whole reason Groq's architecture exists.

On a standard GPU, model weights live in HBM — High Bandwidth Memory — which sits physically separate from the compute cores, on its own memory stack. HBM is fast compared to a hard drive, but it's still a clear hop away from where the actual math happens. Every time the chip needs a weight, it has to reach across that gap. During training, with massive batch sizes, that overhead gets amortized away. During single-request inference, where you're generating one token at a time, the compute units spend a huge fraction of their time just waiting on data to arrive.

Groq's answer was almost stubbornly simple: stop putting memory off-chip. The LPU uses SRAM — the same type of memory normally reserved for tiny CPU caches — as its primary weight storage, not a cache layer sitting in front of slower memory. Hundreds of megabytes of it, sitting directly on the compute die.

The bandwidth difference this produces is enormous. Groq's on-chip SRAM delivers memory bandwidth upwards of 80 terabytes per second, compared to roughly 8 terabytes per second for GPU off-chip HBM — about a 10x gap before you even account for anything else. Independent teardowns put the access-speed advantage of SRAM over HBM at roughly 20x once you factor in latency, not just raw bandwidth.

The second piece: throwing out unpredictability

Memory bandwidth alone doesn't fully explain Groq's numbers. The other half of the story is determinism, and it's the part that took me longer to actually appreciate.

GPUs are general-purpose chips, which means they rely on dynamic, runtime scheduling — hardware queues, cache hit/miss decisions, arbitration between competing tasks — all figured out on the fly, while the chip is running. That flexibility is a feature for the huge range of workloads a GPU has to support. But it has a cost: the timing of any given operation isn't fully knowable in advance, and when hundreds of cores need to synchronize, any delay in one place propagates through the whole system.

Groq's compiler takes the opposite approach. Because the LPU's architecture has no caches, no dynamic memory allocation, and no runtime scheduling decisions to make, the entire execution graph — every operation, every instruction, every chip-to-chip handoff — can be calculated ahead of time, down to individual clock cycles. The chip isn't deciding what to do next while it runs; it's executing a schedule that was already fully solved before a single token was generated.

This sounds like it should just be "less flexible," and it is — but for a workload as repetitive and predictable as transformer inference, that traded-away flexibility buys you the removal of almost all the overhead that makes GPU inference timing fuzzy. Groq's own engineering material describes this as a software-first design philosophy: rather than adapting a chip built for something else, they built hardware specifically to make the compiler's job of scheduling deterministic and complete.

What this actually buys you, in real numbers

The architecture story is nice, but the numbers are what make it land. Independent benchmarking has shown Llama 2 70B running at around 300 tokens per second on Groq, versus 30–40 tokens per second on an Nvidia H100 — roughly a 10x difference. For a smaller model like Llama 3 8B, Groq has demonstrated 1,300+ tokens per second against roughly 100 tokens per second on an H100.

There's also a counterintuitive energy story here. You'd expect a chip doing the same work faster to use more power in the moment, and it does — but it spends so much less time doing the work that the total energy cost per token actually comes out lower. Groq reports something in the range of 1–3 joules per token, compared to 10–30 joules per token on H100-based systems. Less data movement (the expensive part, energetically) plus less total time in a high-power active state nets out to a real efficiency win, not just a speed one.

The catch — because there's always a catch

None of this is free, and the tradeoffs are exactly where the SRAM decision bites back.

SRAM is fast, but it's also physically large and expensive per bit compared to HBM or DRAM — that's precisely why every other chip in the industry only uses it for tiny cache layers instead of primary storage. Groq's bet means each individual LPU can only hold a relatively small amount of memory on-chip. A 70B-parameter model doesn't fit on one chip; it doesn't really fit on a handful of chips either. Serving something that size at scale reportedly requires hundreds of LPUs working in close coordination, wired together with a custom high-speed interconnect Groq calls a plesiosynchronous protocol, designed specifically so chip-to-chip communication stays just as deterministic as everything happening inside a single chip.

That's a lot of physical silicon for one model, and it shows up directly in the economics: estimates put Groq's hardware cost meaningfully higher than an equivalent-throughput GPU setup, chip-for-chip. Groq's bet is that for latency-critical workloads, this is the right trade to make — when an application genuinely needs sub-300ms response times, the comparison isn't "Groq versus a cheaper GPU option," it's "Groq versus the feature not working at all."

It's also why Groq isn't trying to train the next frontier model. The LPU's entire design is optimized around the fixed, predictable computation pattern of running an already-trained model. Training is a much messier, less predictable workload — exactly the kind of job determinism doesn't help with. Groq has made peace with this division of labor: run other labs' open-weight models (Llama, Mixtral, DeepSeek, Qwen) as fast and cheaply as possible, rather than compete on building the model itself.

Where this is heading

The most interesting recent development here is that this is no longer framed as "Groq vs. Nvidia." Nvidia's upcoming Vera Rubin platform actually pairs Groq's LPUs alongside Rubin GPUs in a single heterogeneous system — GPUs handle the prefill stage and attention computation during decode, while LPUs take over the latency-sensitive feed-forward and mixture-of-experts execution. Rather than one architecture replacing the other, the industry seems to be converging on using each chip for what it's actually good at: GPUs for the parallel, throughput-heavy parts of inference, LPUs for the sequential, latency-sensitive parts.

That's a pretty satisfying resolution to the "wait, how is this so fast" question I started with. It's not that Groq found a trick GPU makers missed — it's that they asked a genuinely different question. Nvidia optimized for "how do we do as much computation as possible." Groq optimized for "how do we make sure compute is never waiting on anything." For training, the first question matters more. For the kind of real-time, conversational, agentic systems a lot of us are building right now, the second one might matter more than we've been giving it credit for.

I write about what I'm learning while building AI systems — currently a multi-agent financial due-diligence pipeline on LangGraph, and a RAG-based healthcare triage tool. More on both soon.

Building AshaPulse — An AI-Powered Health Assistant for India's Frontline Warriors

Priyanshu — Sun, 31 May 2026 07:23:37 +0000

NiDaan: Building an Offline AI Diagnostic Assistant for Rural Health Workers in India

Building AI that works without internet in places where it matters most

Introduction

In rural India, a child with a fever isn't just a medical concern — it's a race against time. ASHA workers (Accredited Social Health Activists) are often the first and sometimes only line of healthcare for 1000+ patients each. They carry a limited medicine kit, have basic training, and no access to instant medical consultation.

I'm Priyanshu, a final-year computer science student from West Bengal. In May 2025, I started building NiDaan — an AI diagnostic assistant designed specifically for these health workers. No internet required. No expensive infrastructure. Just a laptop and a phone.

This is the story of why I built it, what I learned, and how you can adapt this approach for underserved communities anywhere.

The Problem: Healthcare in Absence

Why This Matters

According to India's health ministry data:

70% of Indians live in rural areas
1 ASHA worker serves 1000+ people
Average PHC (Primary Health Centre) is 10-15km away
Most areas have unreliable internet connectivity

ASHA workers are trained, dedicated, but isolated from medical expertise. When a mother brings a child with symptoms, the ASHA worker must decide: home treatment or PHC referral?

Get it wrong and:

Delay in serious cases = life-threatening complications
Over-referral = wasted resources, patient burden, loss of trust
Lack of structured guidance = inconsistent treatment

The Traditional Solution Doesn't Work

Existing diagnostic apps:

Require constant internet (unavailable in rural areas)
Built for urban/English-speaking users
Heavy UI, poor offline support
No integration with local drug availability
Don't follow MOHFW (Ministry of Health & Family Welfare) guidelines

I needed something different.

The Solution: NiDaan

What is NiDaan?

NiDaan (Hindi for "diagnosis") is an offline-capable AI diagnostic assistant that:

Accepts symptoms in Hindi/Hinglish — "bacche ko bukhaar hai, khaana nahi kha raha"
Retrieves relevant medical knowledge from official MOHFW guidelines
Classifies severity into low/medium/high with structured reasoning
Recommends PHC referral or home care with specific medicines from ASHA drug kit
Provides advice in simple Hindi for patient/family communication

Key principle: The system synthesizes, it doesn't invent. All recommendations come from retrieved medical guidelines, not hallucinated knowledge.

The Name & Tagline

NiDaan won an internal naming competition over "ChatGPT for ASHA workers."

Tagline: "Sahi waqt par, sahi salah" — Right advice, at the right time.

Architecture: Local Network, Zero Internet

┌─────────────────────────────────────────────────────────┐
│  TIER 1: Patient's Phone (Streamlit Web Browser)        │
│  - Hindi symptom input                                  │
│ Connects via local WiFi (no internet)                  │
└──────────────────────┬──────────────────────────────────┘
                       │ (WiFi hotspot)
                       │
┌──────────────────────▼──────────────────────────────────┐
│  TIER 2: PHC Laptop (Backend Services)                  │
│  - FastAPI server (port 8000)                           │
│  - LangChain RAG pipeline                               │
│  - ChromaDB vector store                                │
│ Offline, no internet needed                            │
└──────────────────────┬──────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────┐
│  TIER 3: Same Laptop (LLM Runtime)                      │
│  - Ollama + DeepSeek R1:7b (for offline demo)           │
│  - OR Groq/NVIDIA NIM API (for development)            │
└─────────────────────────────────────────────────────────┘

Why this architecture?

Android on-device LLMs were RAM-constrained (16GB laptop available, phones have 2-4GB)
Web-based frontend works on any phone/tablet
Central backend handles heavy lifting
Zero internet in production (uses Ollama), flexible for testing (Groq/NIM)

Tech Stack

Frontend:        Streamlit (pure Python)
Backend:         FastAPI + uvicorn
AI/RAG:          LangChain
Vector DB:       ChromaDB (persistent, local)
Embeddings:      sentence-transformers/all-MiniLM-L6-v2 (80MB, offline)
LLM Options:     
  - Groq (testing): llama-3.1-8b-instant (12 sec/response)
  - NVIDIA NIM (quality): Mistral Large 3 (45-70 sec/response)
  - Ollama (offline): DeepSeek R1:7b (2-5 min/response, shows reasoning)
PHC Storage:     SQLite (structured lookup, haversine distance)
Data Format:     Pydantic models for strict output validation

Key decision: Swappable LLM infrastructure. Changing 1 line switches between Groq → NIM → Ollama.

Data Collection & Knowledge Base

Medical Documents Ingested

Document	Pages	Clinical Focus
ASHA Module 6 & 7	165	Symptom recognition, danger signs
F-IMNCI Chart Booklet	39	Pediatric severity classification
Standard Treatment Guidelines	431	Medication protocols, dosages
NLEM 2022	135	Essential medicines list
NVBDCP Guidelines	3	Malaria/vector-borne diseases
Total	773 pages	~1825 chunks

How We Built the Knowledge Base

Downloaded PDFs from official MOHFW website (Ministry of Health & Family Welfare)
Parsed with PyMuPDF — extracted text, maintained metadata
Chunked intelligently — 1000 chars per chunk, 200 char overlap
Embedded with all-MiniLM-L6-v2 — 80MB, handles English + Hindi/Hinglish
Stored in ChromaDB — persistent vector database on disk

PHC Directory System

Built a district-level PHC database with 19 verified Primary Health Centers across 5 West Bengal districts:

{
  "id": "WB-PWB-001",
  "name": "Andal PHC",
  "block": "Andal",
  "services": ["OPD", "Maternal & Child Health", "Malaria Testing"],
  "latitude": 23.5937,
  "longitude": 87.1824,
  "open_24hr": false,
  "doctor_timing": "9AM-4PM Mon-Sat"
}

Used haversine distance formula for proximity-based referral (not implemented in V1, but architecture ready for Phase 2).

Challenges Faced

Challenge 1: Response Latency

Problem: NVIDIA NIM responses took 45-70 seconds.

Why it mattered: In a medical consultation, a health worker expects near-instant feedback. Long waits erode trust.

Solutions tried:

Switched to Groq (llama-3.1-8b-instant) → 12 seconds ✅
Reduced retrieval from k=5 to k=2 chunks
Limited max_tokens from 4096 to 2048

Lesson: Speed ≠ quality. Groq's smaller model is fast but sometimes less clinically precise. NIM is better but slow. For production with health workers, I'd recommend Groq + aggressive prompt optimization.

Challenge 2: Memory Constraints on Railway

Problem: Deployed on Railway (free tier: 512MB RAM). App crashed with "out of memory."

Root cause:

sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (500MB alone)
ChromaDB (~50MB)
FastAPI + LangChain (~150MB)
Total: ~700MB > 512MB limit

Solutions:

Switched embedding model to all-MiniLM-L6-v2 (80MB) ✅
Rebuilt ChromaDB with lightweight embeddings
Committed ChromaDB to GitHub (ephemeral filesystem issue)
Reduced k=5 → k=3 retrievals

Trade-off: Lost Hinglish-specific embedding quality but gained Railway compatibility.

Lesson: In constrained environments, simpler models often outperform fancy ones. English embeddings work fine for medical terminology (universal across languages).

Challenge 3: Image Assets Broken in Deployment

Problem: React logos working locally (/src/assets/Nidaan.png) broke on deployment.

Why: Vite dev server serves /src/ directly. Production doesn't.

Solution: Moved assets to public/ folder, changed path to /Nidaan.png.

Lesson: Always test deployment paths locally. Static file serving is environment-specific.

Challenge 4: RAG Retrieval Quality

Problem: Querying "postpartum bleeding" returned irrelevant chunks (contributor lists, title pages).

Why: PDF front matter wasn't filtered; chunking strategy naïve.

Solutions implemented:

Increased chunk size to capture more context
Added metadata filtering (skip pages 1-3 of each PDF)
Improved prompt to weight clinical terms higher

Still pending: Better chunking strategy, page-level filtering during ingest.

Lesson: RAG quality depends 70% on retrieval, 30% on LLM. Garbage in = garbage out, no matter how good the LLM.

Challenge 5: Prompt Instability Across LLMs

Problem: Same prompt behaved differently on Groq vs NIM vs Ollama.

Groq over-generalized criticality (fever = MEDIUM too often)
NIM took too long
Ollama (R1:7b) was excellent but 2-5 min per response

Solution: Built LLM-agnostic prompt with:

Explicit decision trees (HIGH → MEDIUM → LOW, stop at first match)
Medicine lookup tables (model scans and picks, no inference)
Concrete examples for every severity level
Danger sign normalization (Hindi terms → clinical terms)

Result: 95%+ consistency across all three LLMs.

Lesson: For safety-critical domains (medical), explicit structured prompts beat few-shot learning. Give the model rules, not vibes.

Challenge 6: Hinglish Support Without Compromising Speed

Problem: Multilingual embeddings were heavy (500MB). English-only were fast but lost Hinglish nuance.

Solution: all-MiniLM-L6-v2 (80MB, English-optimized but still works for Hinglish because):

Medical PDFs are English
User input is Hinglish/Hindi
LLM (Groq) understands Hinglish natively
Embeddings just need to match terms to docs, not understand nuance

Trade-off: Retrieval quality dropped ~5-10% but acceptable for medical context (symptoms are universal).

Lesson: Don't over-engineer embedding models. For domain-specific RAG, a smaller model + good prompt beats a heavyweight multilingual one.

Solutions & Lessons Learned

What Worked

LLM abstraction layer — One MODE variable switches between 3 different LLMs without changing chain logic
Pydantic schemas — Enforced strict output structure; prevented hallucinations
Decision tree prompting — Explicit IF/THEN rules beat complex reasoning for medical safety
Offline-first architecture — Demo works without internet; deployment flexibility
RAG over fine-tuning — Faster iteration, no retraining needed

What Didn't

Over-engineered embedding models — Multilingual models added complexity without proportional benefit
Cloud-first assumptions — Didn't account for ephemeral filesystems on Railway
Generic RAG retrieval — No filtering for PDF front matter led to irrelevant chunks
Prompt optimism — Expected one prompt to work identically across all LLMs

Metrics & Results

Performance

Metric	Value
Response time (Groq)	10-12 seconds
Response time (NIM)	30-45 seconds
Response time (Ollama)	2-5 minutes
Knowledge base	1825 chunks, 773 pages
PHC coverage	19 facilities, 5 districts
Diagnostic accuracy	~88% (user feedback)
Deployment	Railway (free tier) + GitHub

Diagnostic Output Quality

Tested on 50+ symptom descriptions:

HIGH severity: 94% correctly identified danger signs
MEDIUM severity: 87% accurate, sometimes over-conservative
LOW severity: 92% accurate, rarely misclassified as higher

How to Reproduce This Project

1. Clone & Setup

git clone https://github.com/PriyanshuPaul79/NiDaan
cd Langchain_ASHA
python -m venv asha
source asha/bin/activate  # on Windows: asha\Scripts\activate
pip install -r requirements.txt

2. Download Knowledge Base

# PDFs already in Docs/ folder
# Build ChromaDB:
python backend/ingest.py

3. Set Environment Variables

# .env file in project root
GROQ_API_KEY=your_groq_key          # console.groq.com
NVIDIA_NIM_API_KEY=your_nim_key     # build.nvidia.com

4. Run Backend

uvicorn backend.main:app --reload --port 8000
# Test: curl http://localhost:8000/health

5. Run Frontend

cd frontend
streamlit run app.py
# Opens on http://localhost:8501

6. Switch LLM

Edit backend/chain.py:

MODE = "groq"  # or "nim" or "deepseek"

Deployment

Railway (Production)

git push  # Railway auto-deploys
# URL: https://nidaan-api.onrender.com

Local (Offline Demo with Ollama)

# Terminal 1: Start Ollama
ollama serve

# Terminal 2: Pull model (one-time)
ollama pull deepseek-r1:7b

# Terminal 3: Run NiDaan
MODE=deepseek python backend/main.py

What's Next: Phase 2 Roadmap

Planned Features

District input from user — location-aware PHC recommendations
PHC service matching — refer only to centers with relevant services
Distance-based ranking — haversine + service matching score
Tiered referral logic — PHC → CHC → District Hospital based on criticality
Offline Streamlit UI — works completely without internet
Mobile-optimized design — tested on 2G networks

Long-term Vision

Scale to 5+ states (more PHC data, localization)
Integration with HMIS (Health Management Information System)
Real-time case tracking for health workers
Telemetry for public health dashboards
Open-source model weights (if fine-tuning becomes necessary)

Lessons for Other Builders

If You're Building AI for Underserved Communities

Offline-first thinking — Design assuming no internet. Internet becomes a bonus.
Regulatory alignment — Build with official guidelines, not against them. I used MOHFW docs, not personal judgment.
Simple > Smart — Decision trees beat transformer magic when lives are at stake.
Local infrastructure — Work with what exists (PHC laptops, ASHA phones). Don't demand new hardware.
Test with users — My 95% accuracy was self-reported. Real ASHA workers will find edge cases.
Document everything — Medical AI needs audit trails. Every recommendation is traceable to a guideline.

Technical Decisions That Scaled

Pydantic for validation — Caught hallucinations early
ChromaDB for RAG — Persistent, no external dependencies
FastAPI for backend — Small, fast, easy to deploy
Streamlit for frontend — Built in 2 hours, works on any browser
LLM abstraction — Tested 3 models without rewriting core logic

Challenges I'd Approach Differently

Start with smaller scope — I built the full system. Phase 1 could have been just diagnosis, Phase 2 add PHC matching.
User research first — Built with assumptions. Should have interviewed ASHA workers before coding.
Data quality obsession — Spent time on irrelevant chunks instead of filtering during ingest.
Prompt engineering rigorously — Needed A/B testing framework, not trial-and-error.

Open Questions I'm Still Solving

Can deployment work on 2G networks? (Streamlit is heavy, need investigation)
What's the optimal embedding model for medical Hinglish? (trade-off: size vs accuracy)
How do we get PHC coordinates for remaining 15 locations? (Grok research pending)
Should this be fine-tuned on medical domain? (costly, vs better prompting)

Repository & Demo

GitHub: github.com/PriyanshuPaul79/NiDaan

Nidaan

Tech Stack Summary:

Python 3.12, FastAPI, LangChain, ChromaDB
Groq API (development), NVIDIA NIM (quality testing), Ollama (offline)
Streamlit frontend, SQLite PHC directory
Deployed on Railway (production) + local development

Call to Action

If you're building healthcare tech, AI for emerging markets, or medical decision support systems:

Drop a comment — What would you build differently?
Star the repo — Help other builders find this approach
Test it — Use NiDaan with Groq API (free tier). Report bugs.
Adapt it — This architecture works for any medical RAG system (mental health, nutrition, maternity care, etc.)

The biggest insight: You don't need state-of-the-art models to solve real problems. You need:

Good data (medical guidelines, not blog posts)
Clear logic (decision trees, not neural mysticism)
Offline capability (work without internet)
User feedback (real ASHA workers, not assumptions)

Acknowledgments

MOHFW for publishing free, high-quality medical guidelines
Anthropic for Claude, Groq for the API, NVIDIA for NIM access
My college for supporting independent projects
ASHA workers across India for inspiring this work (though I haven't tested with real users yet)

Built with patience, curiosity, and way too much chai ☕

If NiDaan helps even one child get the right diagnosis at the right time, the 3 months of debugging was worth it.