DEV Community

Ricardo Saumeth
Ricardo Saumeth

Posted on

๐—›๐—ผ๐˜„ ๐—œ ๐—•๐˜‚๐—ถ๐—น๐˜ ๐—ฎ ๐— ๐˜‚๐—น๐˜๐—ถ๐—น๐—ถ๐—ป๐—ด๐˜‚๐—ฎ๐—น ๐—ฅ๐—”๐—š ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ ๐—ณ๐—ผ๐—ฟ ๐—จ๐—ป๐—ฑ๐—ฒ๐—ฟ $๐Ÿฌ.๐Ÿฌ๐Ÿฌ๐Ÿฌ๐Ÿฑ ๐—ฝ๐—ฒ๐—ฟ ๐—ค๐˜‚๐—ฒ๐˜€๐˜๐—ถ๐—ผ๐—ป

Most RAG stacks rely on paid embeddings, paid vector DBs, and GPTโ€‘4 for everything.
It works, but itโ€™s expensive and overโ€‘engineered for many realโ€‘world use cases.

I wanted something different:
a productionโ€‘grade multilingual RAG system that costs pennies to run.

Hereโ€™s the architecture.

๐—ฆ๐—ฒ๐—น๐—ณโ€‘๐—ต๐—ผ๐˜€๐˜๐—ฒ๐—ฑ ๐—บ๐˜‚๐—น๐˜๐—ถ๐—น๐—ถ๐—ป๐—ด๐˜‚๐—ฎ๐—น ๐—ฒ๐—บ๐—ฏ๐—ฒ๐—ฑ๐—ฑ๐—ถ๐—ป๐—ด๐˜€ (๐—ฐ๐—ผ๐˜€๐˜: $๐Ÿฌ ๐—ฝ๐—ฒ๐—ฟ ๐—พ๐˜‚๐—ฒ๐—ฟ๐˜† โ€” ๐—ป๐—ผ ๐—ฝ๐—ฒ๐—ฟโ€‘๐˜‚๐˜€๐—ฎ๐—ด๐—ฒ ๐—ฐ๐—ต๐—ฎ๐—ฟ๐—ด๐—ฒ๐˜€)
A sentenceโ€‘transformers model running locally:

  • 50โ€“100ms per embedding
  • Zero marginal cost
  • Smaller vectors โ†’ faster search
  • Works offline
  • Caching eliminates repeated work This removes all embeddingโ€‘related spend.

๐—ฝ๐—ด๐˜ƒ๐—ฒ๐—ฐ๐˜๐—ผ๐—ฟ ๐—ถ๐—ป๐˜€๐˜๐—ฒ๐—ฎ๐—ฑ ๐—ผ๐—ณ ๐—ฃ๐—ถ๐—ป๐—ฒ๐—ฐ๐—ผ๐—ป๐—ฒ/๐—ค๐—ฑ๐—ฟ๐—ฎ๐—ป๐˜ (๐—ฐ๐—ผ๐˜€๐˜: $๐Ÿฌ)
Everything lives inside Postgres:

  • vectors
  • metadata
  • filtering
  • HNSW similarity search

For <100K vectors, queries land in 50โ€“200ms.
One database โ†’ one deployment โ†’ no sync issues.

๐—š๐—ฃ๐—งโ€‘๐Ÿฐ๐—ผโ€‘๐—บ๐—ถ๐—ป๐—ถ ๐—ณ๐—ผ๐—ฟ ๐—ด๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป (~$๐Ÿฌ.๐Ÿฌ๐Ÿฌ๐Ÿฌ๐Ÿฎ๐Ÿณโ€“$๐Ÿฌ.๐Ÿฌ๐Ÿฌ๐Ÿฌ๐Ÿฐ๐Ÿฑ)
Optimized RAG queries typically use:

  • ~600 input tokens
  • ~300 output tokens

At GPTโ€‘4oโ€‘mini pricing, thatโ€™s roughly:
~$0.00027 per question.
Even with larger contexts (1000 input + 500 output), cost stays under:
~$0.0005 per question.

๐—” ๐—บ๐—ถ๐—ป๐—ถ๐—บ๐—ฎ๐—น, ๐—ฐ๐—ผ๐˜€๐˜โ€‘๐—ฒ๐—ณ๐—ณ๐—ถ๐—ฐ๐—ถ๐—ฒ๐—ป๐˜ ๐—ฑ๐—ฒ๐˜€๐—ถ๐—ด๐—ป

  • FastAPI backend
  • Postgres + pgvector
  • Lightweight cache
  • Static frontend

No GPUs.
No microservices.
No expensive managed services.

๐—ช๐—ต๐˜† ๐—ถ๐˜ ๐˜„๐—ผ๐—ฟ๐—ธ๐˜€

  • One unified system โ†’ fewer failure modes
  • Caching turns repeated queries into 100ms lookups
  • Multilingual embeddings remove translation cost + latency
  • GPTโ€‘4oโ€‘mini is the perfect balance of speed, quality, and cost

๐—ง๐—ต๐—ฒ ๐—ฏ๐—ผ๐˜๐˜๐—ผ๐—บ ๐—น๐—ถ๐—ป๐—ฒ
You donโ€™t need Pinecone, GPTโ€‘4, GPUs, or a complex microservices architecture.

You need:
pgvector + sentenceโ€‘transformers + GPTโ€‘4oโ€‘mini + a caching layer.

Thatโ€™s how you get to ~$0.0003 per question.

๐—ช๐—ฟ๐—ถ๐˜๐˜๐—ฒ๐—ป ๐—ฏ๐˜† ๐—ฅ๐—ถ๐—ฐ๐—ฎ๐—ฟ๐—ฑ๐—ผ ๐—ฆ๐—ฎ๐˜‚๐—บ๐—ฒ๐˜๐—ต
๐—ฆ๐—ฒ๐—ป๐—ถ๐—ผ๐—ฟ ๐—™๐—ฟ๐—ผ๐—ป๐˜โ€‘๐—˜๐—ป๐—ฑ ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ | ๐—ฅ๐—ฒ๐—ฎ๐—นโ€‘๐—ง๐—ถ๐—บ๐—ฒ ๐—จ๐—œ ๐—ฆ๐—ฝ๐—ฒ๐—ฐ๐—ถ๐—ฎ๐—น๐—ถ๐˜€๐˜

Top comments (0)