Daniel Nwaneri

Posted on Jan 25

My $5/month RAG System Just Got Eyes: Adding Multimodal Search Without Breaking the Bank

#cloudflare #ai #webdev #opensource

Last month, I showed you how my V2 beat $200 enterprise RAG systems with hybrid search and reranking. The response was incredible but one comment stuck with me:

"This is great for text, but what about images? My team has thousands of screenshots and diagrams we can't search."

So I rebuilt it again. This time, I gave my RAG system vision.

Who Should Read This?

Freelancers/agencies managing client screenshots, bug reports, design files
Teams drowning in Slack screenshots that are unsearchable
Anyone tired of "what was that dashboard screenshot called again?"
Developers wanting multimodal RAG without $100/month bills

Why This Matters: The Cost Reality

Provider	10K Images/Month	What You Get
OpenAI Vision API	$100/month	Vision only (no search, no storage)
Google Vertex AI	$15/month	Vision only (no embeddings)
AWS Rekognition	$12/month	Labels only (no semantic search)
Pinecone + OpenAI	$120/month	Vision + Search (separate services)
This Project	$5/month	Vision + OCR + Embeddings + Hybrid Search + Storage

The hidden cost: Most solutions charge separately for vision, embeddings, and search. This project includes everything on Cloudflare's edge.

The Problem With V2

V2 could find anything in text documents. But when clients uploaded:

📸 Dashboard screenshots
📊 Technical diagrams
📋 Scanned documents
🐛 Error message screenshots

The system was blind. It could only search by filename (screenshot-2026-01-15.png - useless) or manually added descriptions (which nobody bothers writing).

In 2026, text-only search feels like using a flip phone.

The Upgrade: Making RAG "See"

I added Llama 4 Scout (Meta's 17B multimodal model) to the stack. Now when you upload an image:

Llama 4 Scout analyzes the pixels - generates a detailed description
OCR extracts visible text - captures button labels, error messages, code
BGE creates embeddings - makes it all searchable
Stores in the same index - no separate image database needed

How It Works:

┌─────────────────────────────────────────────────────────┐
│                    Image Upload (40KB)                  │
└────────────────────┬────────────────────────────────────┘
                     │
                     ▼
         ┌───────────────────────┐
         │   Llama 4 Scout       │
         │   (Multimodal Model)  │
         └───────┬───────────────┘
                 │
        ┌────────┴────────┐
        ▼                 ▼
┌──────────────┐   ┌──────────────┐
│ Semantic     │   │ OCR Text     │
│ Description  │   │ Extraction   │
│ (1,865 chars)│   │ (1,043 chars)│
└──────┬───────┘   └──────┬───────┘
       │                  │
       └────────┬─────────┘
                ▼
        ┌───────────────┐
        │ BGE Embedding │
        │ (384 dims)    │
        └───────┬───────┘
                ▼
    ┌──────────────────────┐
    │  Vectorize + D1      │
    │  (Single Index)      │
    └──────────┬───────────┘
               ▼
    ┌─────────────────────┐
    │  Searchable!        │
    │  • By meaning       │
    │  • By text          │
    │  • By similarity    │
    └─────────────────────┘

Processing time: 7.9 seconds

Search time: 900ms (first) → 0ms (cached)

The Code:

// The magic: Images become searchable text
const visionResponse = await env.AI.run('@cf/meta/llama-4-scout-17b-16e-instruct', {
  messages: [{
    role: 'user',
    content: [
      { type: 'image_url', image_url: { url: `data:image/png;base64,${base64Image}` }},
      { type: 'text', text: 'Describe this image in detail for search indexing.' }
    ]
  }]
});

// Combine semantic description + extracted text
const searchableContent = `${description}\n\nVisible Text: ${ocrText}`;

// Store in the same 384-dim index as regular documents
await env.VECTORIZE.upsert([{
  id: imageId,
  values: embedding,
  metadata: { content: searchableContent, isImage: true }
}]);

Why this works:

✅ Single unified index (images + text coexist)

✅ Hybrid search still applies (Vector + BM25)

✅ OCR makes screenshots searchable by visible text

✅ Same $5/month cost

Why Not Use Separate OCR?

Short answer: Llama 4 Scout does OCR + semantic understanding in one call.

Long answer:

Tesseract can't run on Workers - needs native binaries, breaks serverless
Context matters - Llama 4 understands table structures, headers, layouts (Tesseract just dumps text linearly)
Efficiency - One API call vs. two (vision + OCR separately)
Fallback resilience - If OCR fails, semantic description still makes it searchable

Trade-off: Dedicated OCR might be 2-3% more accurate on printed text, but Llama 4's multimodal understanding gives better search results.

Performance: Before vs After

Metric	V2	V3 (Multimodal)
Search Types	Text only	Text + Images + Scanned Docs
Image Ingestion	❌	✅ 7.9s (Llama 4 + OCR)
OCR Extraction	❌	✅ 1,000+ chars (receipts, forms, diagrams)
Reverse Image Search	❌	✅ 8s
Latency (text search)	~900ms	~900ms (unchanged!)
Latency (cached)	N/A	0ms (new cache)
Cost	$5/month	$5/month
Document Types	Text, Code, Markdown	+ Screenshots, Receipts, Forms, Diagrams

Yes, the cost didn't change. Cloudflare's edge deployment means you're not paying for idle GPU time.

Real-World Test: Visual Bug Reports

I tested with actual use cases from my consulting work:

Test 1: Screenshot Search

Uploaded: Dashboard screenshot with metrics cards

Search query: "Find dashboards with performance metrics"

Result: ✅ Found 3 similar screenshots in 1.1s

What it matched:

Description: "dashboard interface with metrics cards"
OCR text: "Response Time: 847ms", "Throughput: 2.4K/s"

Test 2: Error Message Recognition

Uploaded: Screenshot of React error in browser console

Search query: "TypeError undefined property"

Result: ✅ Matched via OCR text extraction

OCR captured:

TypeError: Cannot read property 'map' of undefined
    at ProductList.jsx:42

Test 3: Diagram Discovery

Uploaded: Architecture diagram with boxes and arrows

Search query: "microservices architecture"

Result: ✅ Matched via semantic description

Llama 4 described it as:

"Architecture diagram showing microservices pattern with API Gateway, Service Discovery, and multiple backend services connected via message queue"

Real-World Test: Financial Document Search

To prove this isn't just for tech screenshots, I threw it a real challenge: a Nigerian bank receipt with mixed English/abbreviations, account numbers, and structured financial data (40KB JPEG).

What Llama 4 Scout Extracted:

OCR (1,043 characters):

Transaction Amount: N30,000
Transaction Type: INTER-BANK
Sender: CHUKWUDI NWANERI
Beneficiary: BOBMANUEL CECILIA OGECHI
Account: 3113880181
Bank: First Bank of Nigeria
Reference: NXG000014260102194419228984379203

Semantic Description (1,865 characters):

"Transaction receipt from Access Bank, detailing a successful inter-bank transfer. The receipt is structured into sections with header, transaction details, sender/beneficiary information..."

Processing Time: 7.9 seconds (vision + OCR + embedding)

Search Results - 5 Different Queries:

I tested every searchable element. Every query found the receipt as the #1 result:

Query	Result	Time
"N30000 transfer"	✅ #1 match	1.2s
"BOBMANUEL CECILIA"	✅ #1 match	609ms
"Access Bank transaction"	✅ #1 match	527ms
"NXG000014260102194419228984379203"	✅ #1 match	601ms
"Access Bank transaction" (repeat)	✅ #1 match	0ms (cached!)

This proves:

✅ Semantic search - "N30000 transfer" matched without exact text

✅ Name extraction - Found partial name "BOBMANUEL CECILIA"

✅ Exact text matching - 30-character transaction reference found instantly

✅ Cache working - Repeat query eliminated all latency

Use Cases This Enables:

📄 Receipt Management

Upload scanned receipts, invoices, bills. Search by:

Amount: "show me transfers over N25000"
Vendor: "find all Access Bank transactions"
Date: "transactions from January 2026"

💼 Financial Audit Trails

Search transactions by reference number
Find transfers by recipient name
Track spending patterns across documents

🏦 Compliance & Bookkeeping

Searchable transaction history without manual data entry
Automated document categorization by bank/type
Audit-ready record keeping with instant retrieval

🔒 Privacy-First

Your financial documents never leave Cloudflare's network. No OpenAI API calls, no Google Cloud uploads - just edge processing.

All for $5/month.

The "I Tried CLIP and It Failed" Story

Initially, I wanted to use CLIP (OpenAI's vision-language model) for "true" visual embeddings. The plan was beautiful:

Image → CLIP → Visual embedding (512 dims) → Separate index

Problem: Cloudflare Workers AI doesn't support CLIP.

Error code 5018: "This account is not allowed to access this model."

After wasting a weekend on this, I realized something: For RAG use cases, descriptions work better than visual embeddings anyway.

Why?

Descriptions are searchable by meaning ("red button") and text ("Submit")
Visual embeddings only match pixel similarity (good for "find similar images", bad for "find the login screen")
Single index is simpler than dual-index systems

Lesson learned: Sometimes the "clever" solution is worse than the simple one.

How This Compares to Multimodal Alternatives

Feature	OpenAI Vision API	Google Vertex AI	This Project
Base cost	$0.01/image	$0.0015/image	Included in $5/month
OCR	Not included	Separate API ($1.50/1K pages)	Built-in
Hybrid search	No	No	✅ Vector + BM25
Reranking	No	No	✅ Cross-encoder
Edge latency	200-500ms	300-600ms	~900ms (first), 0ms (cached)
Data leaves network	✅ Yes	✅ Yes	❌ No (Cloudflare only)
Setup complexity	API integration	Complex SDK	`wrangler deploy`
Storage included	No (S3 separate)	No (GCS separate)	✅ D1 + Vectorize

At scale (10K images/month):

OpenAI Vision: $100/month (just for vision, excluding embeddings & storage)
Google Vertex AI: $15/month (vision only) + $10/month (embeddings) + storage
AWS Rekognition: $12/month (labels only) + separate search solution
This stack: $5/month (everything included)

At scale (100K images/month):

OpenAI: $1,000/month
Google: $250/month
This stack: ~$50/month (still 20x cheaper)

New Features in V3

1. Image Ingestion Endpoint

curl -X POST https://your-worker.dev/ingest-image \
  -F "id=dashboard-001" \
  -F "image=@screenshot.png" \
  -F "category=ui-screenshots"

Response:

{
  "success": true,
  "documentId": "dashboard-001",
  "description": "Dashboard interface with...",
  "extractedText": "API Key\nEnter your API key\nTest...",
  "performance": {
    "multimodalProcessing": "4852ms",
    "totalTime": "7737ms"
  }
}

2. Reverse Image Search

Upload an image, find visually similar ones:

curl -X POST https://your-worker.dev/find-similar-images \
  -F "image=@query.png" \
  -F "topK=5"

Use cases:

"Find screenshots that look like this"
"Match product photos"
"Locate similar diagrams"

3. 60-Second Cache (New!)

After rebuilding, I added caching. Same query within 60s? 0ms response.

First search:  929ms
Cached search: 0ms ✨

Real log output:

POST /search - Ok @ 3:33:00 PM
POST /search - Ok @ 3:33:02 PM
  (log) Cache hit!

How it works:

In-memory cache (not Workers KV - that adds 50-100ms latency)
Caches final search results (~5KB per query)
60-second TTL (queries expire after 1 minute - balances freshness vs performance)
Uses <1MB of Worker's 128MB RAM

4. Batch Embeddings (Optimization)

V2 generated embeddings sequentially (slow). V3 uses Promise.all():

Before: 3 chunks → 3 seconds (sequential)

After: 3 chunks → 1.2 seconds (parallel)

// V2: Sequential (slow)
for (const chunk of chunks) {
  const emb = await env.AI.run('@cf/baai/bge-small-en-v1.5', {text: chunk});
  vectors.push(emb);
}

// V3: Parallel (fast)
const embeddings = await Promise.all(
  chunks.map(chunk => env.AI.run('@cf/baai/bge-small-en-v1.5', {text: chunk}))
);

The Tech Stack (Updated)

All still on Cloudflare's edge:

Workers - Runtime (serverless, globally distributed)
Vectorize - Vector database (384 dims, single unified index)
D1 - SQL database for BM25 keywords
Workers AI:
- @cf/meta/llama-4-scout-17b-16e-instruct (vision + OCR)
- @cf/baai/bge-small-en-v1.5 (embeddings - 384 dims)
- @cf/baai/bge-reranker-base (cross-encoder reranking)

Why 384 dimensions?

Tested: 384 dims achieves 66.43% MRR@5 vs 56.72% for semantic-only
Upgrading to 768 dims only improves to ~68% (2% gain)
But doubles cost and latency
Better to use reranker (adds 9.3 percentage points for minimal cost)

No external APIs. No data leaving Cloudflare's network.

Deployment (Still 10 Minutes)

git clone https://github.com/dannwaneri/vectorize-mcp-worker.git
cd vectorize-mcp-worker
npm install

# Create resources
wrangler vectorize create mcp-knowledge-base --dimensions=384
wrangler d1 create mcp-knowledge-db

# Update wrangler.toml with database_id, then:
wrangler d1 execute mcp-knowledge-db --remote --file=./schema.sql
wrangler deploy

Test it:

# Upload an image
curl -X POST https://your-worker.dev/ingest-image \
  -F "id=test-001" \
  -F "image=@screenshot.png"

# Search by text
curl -X POST https://your-worker.dev/search \
  -H "Content-Type: application/json" \
  -d '{"query": "dashboard metrics", "topK": 5}'

What This Can't Do (Yet)

Handwriting recognition - Llama 4 Scout struggles with cursive
Complex math equations - LaTeX rendering in images isn't perfect
Video analysis - Only processes static images (frame extraction coming in V4?)
Non-Latin scripts - Haven't tested Arabic/Chinese/Cyrillic thoroughly

If your use case needs these, let me know in the comments - might prioritize them.

What's Next?

Considering:

✅ Multimodal search (Done!)
✅ 60-second cache (Done!)
✅ Batch embeddings (Done!)
Face/object detection (if there's demand)
Video frame analysis
PDF image extraction

But honestly? This covers 95% of multimodal RAG use cases.

Try It Live

Dashboard: vectorize-mcp-worker.fpl-test.workers.dev/dashboard
GitHub: github.com/dannwaneri/vectorize-mcp-worker

Upload a screenshot and search for it. You'll see why 2026 is the year RAG gets eyes.

Need help deploying this for your team? Hire me on Upwork

⭐ Star the repo if this helps your project!

Questions? Drop them in the comments.

Top comments (12)

Ali-Funk • Jan 25 • Edited

Great timing. I just spent my evening building an image analyzer with AWS Rekognition and Lambda. It is interesting to see how you tackled the 'giving code eyes' problem with Cloudflare and Llama instead.
The pivot from CLIP to text descriptions for RAG is a smart move for accuracy.

What I love most: You added images and you kept costs very low for what it can do!

The section about "CLIP failing" is the most valuable part here for me.
We usually only see the polished wins, not the dead ends.

I just battled some S3 event triggers and encoding bugs myself tonight. Debugging these integration edges is where the real learning happens. I am not used to coding myself since I come from a background of system integration not programming. It was very hard to get it running the way it was supposed to.

Great work @dannwaneri

Daniel Nwaneri • Jan 25

thanks ali. really appreciate you following from chrome tabs to this.

the rekognition + lambda approach is solid - aws has the accuracy edge on pure OCR for sure. curious though: are you planning to make those analyzed images searchable later? that's where i hit the cost wall (rekognition analysis + bedrock embeddings + opensearch was pushing $150/month for what i needed).

the clip → text description pivot was frustrating (wasted a weekend on it) but yeah, for RAG use cases descriptions work better than visual embeddings anyway

how's rekognition handling complex layouts? receipts, forms, diagrams with mixed text/graphics? that's where llama 4's multimodal understanding surprised me . it gets context, not just character extraction

keep building 🔨

Ali-Funk • Jan 25

That $150/month metric is a huge red flag for me. Thanks for the heads-up. I am strictly optimizing for Free Tier right now, so "OpenSearch" is out of the question.

If I make them searchable later, I would probably start small by just dumping the JSON labels into DynamoDB for basic filtering before looking at vector databases.

Regarding complex layouts: I haven't stress-tested that yet. Today was purely detect_labels (identifying objects like 'Laptop', 'Chair') just to prove the concept and see if I could get the pipeline running.

For receipts and forms, I suspect you are right: standard Rekognition/Textract might give me the text, but Llama likely wins on understanding the "semantic glue" between the fields without custom logic.

That was beyond the scope of what I tried out today, but definitely worth considering for a future project.

I will keep the cost wall in mind moving forward. Saving money on projects is critical for me.

Daniel Nwaneri • Jan 25

the free tier optimization constraint is real. i hit the same wall which is why i went all-in on cloudflare.

the dynamodb filtering approach makes sense for basic queries but yeah, the moment you need "find images with similar layouts" or "dashboards showing performance metrics" you're back to needing embeddings.

the rekognition → textract path gives you accurate OCR but you're right about the semantic glue. llama 4 understanding "this is a receipt header vs line item vs total" without custom parsing logic is the unlock

if you ever want to test multimodal search without leaving free tier constraints, the stack i documented is basically: upload image → llama 4 scout (free on workers ai) → bge embeddings (free) → vectorize (free tier: 10M vectors). only costs when you scale past free limits

the $150/month metric came from real estimates when i priced out rekognition + bedrock + opensearch for a client project. aws pricing forced me to find alternatives

keep optimizing for free tier. that's how you learn what's actually essential vs nice-to-have 🔨

Ali-Funk • Jan 25

That specific stack breakdown (Cloudflare + Vectorize with 10M vectors) is a goldmine.

I am currently deep in the AWS ecosystem for my certification journey, but ignoring that kind of Free Tier value would be foolish. Thanks for validating the 'semantic glue' theory regarding Llama vs. Textract.

I will definitely bookmark your article for when I hit the limits of my current JSON/DynamoDB approach. Real-world client estimates like your $150 example are the best reality check.
Consultants would take real money and a lot of it for your kind of intel! Thank you VERY much Daniel

Daniel Nwaneri • Jan 25

appreciate that ali.

yeah the free tier values on workers ai are wild .cloudflare is basically subsidizing the learning curve right now. 10M vectors in vectorize before you pay anything is unreal compared to pinecone/qdrant pricing.

the aws cert journey makes total sense.
when you do hit the dynamodb filtering limits (and you will . everyone does around 5-10k images), the migration path is clean. your rekognition labels → llama 4 descriptions is mostly just swapping the vision model, embeddings flow the same way.

good luck with the cert! aws knowledge transfers well, you're just learning the cheaper edge compute alternative alongside it 🔨

Ali-Funk • Jan 25

Good to know about that 5k-10k threshold. It is always better to know where the ceiling is before you hit your head on it.

I appreciate the insights today. It is rare to get this level of practical architectural advice in a comment section.

I just sent you a connection request on LinkedIn.
Would be great to keep up with your work there.

Daniel Nwaneri • Jan 25

just accepted. appreciate the connect.

these architectural ceiling conversations are exactly why i write these articles. better to share the 5-10k threshold publicly than have everyone rediscover it through painful experience.

looking forward to seeing what you build. hit me up on linkedin if you run into cloudflare workers questions during your aws cert journey 🚀

myroslav mokhammad abdeljawwad • Feb 9

Honestly this feels like the kind of upgrade people actually end up using, not just demoing. Treating images as normal documents instead of spinning up a separate “vision system” is a really smart call.

The OCR plus description combo makes a lot of sense in practice. When I’m searching screenshots, I’m almost never thinking “find similar pixels”. I’m thinking “where was that error message” or “which screenshot had that button text or transaction number”. This matches how people really look for stuff.

I also like that you’re upfront about the tradeoffs. Sure, pure OCR might squeeze out a bit more accuracy on perfect scans, but once you throw in diagrams, messy screenshots, or UI shots, the semantic layer is doing most of the work anyway.

The pricing angle matters more than people admit. Plenty of teams could build something like this, but once it turns into vision API plus embeddings plus vector DB plus storage, it quietly dies in planning. Keeping it simple and cheap is probably why this actually ships.

One thing I’m wondering about as this grows is access control. Screenshots and receipts can get sensitive fast, and once search starts working well, people lean on it harder than they expect.

myroslav mokhammad abdeljawwad • Feb 9

This is a really practical upgrade. I like that you kept images and text in one index instead of building a second system just for “vision”.

The description plus OCR approach makes a lot of sense for real teams. Most of the time we’re not trying to find pixel-similar images, we’re trying to find meaning and visible text, like error messages, UI labels, or a specific transaction ref.

Also respect the tradeoff notes. Dedicated OCR can be a bit more accurate on clean scans, but the semantic understanding is what actually makes search useful, esp for diagrams and messy screenshots.

Keeping the cost low matters too. A lot of orgs can build this, but they don’t ship it once the pricing turns into 3 different vendors and a billing surprise.

Curious how you’re thinking about permissions and sensitive screenshot data as the corpus grows, because once search works people start relying on it fast.

If you want, paste your exact comment here and I’ll “mutate” it slightly so it keeps the same meaning but wont trigger the duplicate check.

myroslav mokhammad abdeljawwad • Feb 9

This is a really solid evolution of the idea. What I like most is that you didnt bolt “vision” on as a separate system, you treated images as first class knowledge that belongs in the same index as text. That’s the part most multimodal RAG posts skip.

The description-over-CLIP takeaway feels very real too. For most real workflows people dont actually want pixel similarity, they want meaning plus text. Being able to search “TypeError undefined map” and have a screenshot come back because OCR caught it is way more useful than “this image looks similar”.

Also appreciate the honesty around tradeoffs. Calling out that dedicated OCR might be slightly better on clean text but worse overall for search is the kind of detail that tells me this was actually tested, not just assembled from docs.

The cost angle matters a lot as well. Most teams I’ve seen never ship multimodal search not because it’s hard, but because once you add up vision, embeddings, storage, reranking, it quietly turns into a finance discussion. Keeping everything inside Workers and a single index is a big win.

One thing I’m curious about as this grows is how you see metadata and access control evolving. Screenshots and receipts get sensitive fast, and once search gets good people rely on it more than they expect.

Overall though this feels very practical. Not “look what AI can do”, but “here’s how you stop losing information your team already has”. Really nice work.

Vinicius Fagundes • Jan 27

Huge on the the 5k-10k threshold. It is always better to know where the ceiling is before try to optimize without checking.
Also.. Liked your dashboard
congrats.

View full discussion (12 comments)