Most RAG stacks rely on paid embeddings, paid vector DBs, and GPTโ4 for everything.
It works, but itโs expensive and overโengineered for many realโworld use cases.
I wanted something different:
a productionโgrade multilingual RAG system that costs pennies to run.
Hereโs the architecture.
๐ฆ๐ฒ๐น๐ณโ๐ต๐ผ๐๐๐ฒ๐ฑ ๐บ๐๐น๐๐ถ๐น๐ถ๐ป๐ด๐๐ฎ๐น ๐ฒ๐บ๐ฏ๐ฒ๐ฑ๐ฑ๐ถ๐ป๐ด๐ (๐ฐ๐ผ๐๐: $๐ฌ ๐ฝ๐ฒ๐ฟ ๐พ๐๐ฒ๐ฟ๐ โ ๐ป๐ผ ๐ฝ๐ฒ๐ฟโ๐๐๐ฎ๐ด๐ฒ ๐ฐ๐ต๐ฎ๐ฟ๐ด๐ฒ๐)
A sentenceโtransformers model running locally:
- 50โ100ms per embedding
- Zero marginal cost
- Smaller vectors โ faster search
- Works offline
- Caching eliminates repeated work This removes all embeddingโrelated spend.
๐ฝ๐ด๐๐ฒ๐ฐ๐๐ผ๐ฟ ๐ถ๐ป๐๐๐ฒ๐ฎ๐ฑ ๐ผ๐ณ ๐ฃ๐ถ๐ป๐ฒ๐ฐ๐ผ๐ป๐ฒ/๐ค๐ฑ๐ฟ๐ฎ๐ป๐ (๐ฐ๐ผ๐๐: $๐ฌ)
Everything lives inside Postgres:
- vectors
- metadata
- filtering
- HNSW similarity search
For <100K vectors, queries land in 50โ200ms.
One database โ one deployment โ no sync issues.
๐๐ฃ๐งโ๐ฐ๐ผโ๐บ๐ถ๐ป๐ถ ๐ณ๐ผ๐ฟ ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ถ๐ผ๐ป (~$๐ฌ.๐ฌ๐ฌ๐ฌ๐ฎ๐ณโ$๐ฌ.๐ฌ๐ฌ๐ฌ๐ฐ๐ฑ)
Optimized RAG queries typically use:
- ~600 input tokens
- ~300 output tokens
At GPTโ4oโmini pricing, thatโs roughly:
~$0.00027 per question.
Even with larger contexts (1000 input + 500 output), cost stays under:
~$0.0005 per question.
๐ ๐บ๐ถ๐ป๐ถ๐บ๐ฎ๐น, ๐ฐ๐ผ๐๐โ๐ฒ๐ณ๐ณ๐ถ๐ฐ๐ถ๐ฒ๐ป๐ ๐ฑ๐ฒ๐๐ถ๐ด๐ป
- FastAPI backend
- Postgres + pgvector
- Lightweight cache
- Static frontend
No GPUs.
No microservices.
No expensive managed services.
๐ช๐ต๐ ๐ถ๐ ๐๐ผ๐ฟ๐ธ๐
- One unified system โ fewer failure modes
- Caching turns repeated queries into 100ms lookups
- Multilingual embeddings remove translation cost + latency
- GPTโ4oโmini is the perfect balance of speed, quality, and cost
๐ง๐ต๐ฒ ๐ฏ๐ผ๐๐๐ผ๐บ ๐น๐ถ๐ป๐ฒ
You donโt need Pinecone, GPTโ4, GPUs, or a complex microservices architecture.
You need:
pgvector + sentenceโtransformers + GPTโ4oโmini + a caching layer.
Thatโs how you get to ~$0.0003 per question.
๐ช๐ฟ๐ถ๐๐๐ฒ๐ป ๐ฏ๐ ๐ฅ๐ถ๐ฐ๐ฎ๐ฟ๐ฑ๐ผ ๐ฆ๐ฎ๐๐บ๐ฒ๐๐ต
๐ฆ๐ฒ๐ป๐ถ๐ผ๐ฟ ๐๐ฟ๐ผ๐ป๐โ๐๐ป๐ฑ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ | ๐ฅ๐ฒ๐ฎ๐นโ๐ง๐ถ๐บ๐ฒ ๐จ๐ ๐ฆ๐ฝ๐ฒ๐ฐ๐ถ๐ฎ๐น๐ถ๐๐

Top comments (0)