Last month, I showed you how my V2 beat $200 enterprise RAG systems with hybrid search and reranking. The response was incredible but one comment stuck with me:
"This is great for text, but what about images? My team has thousands of screenshots and diagrams we can't search."
So I rebuilt it again. This time, I gave my RAG system vision.
Who Should Read This?
- Freelancers/agencies managing client screenshots, bug reports, design files
- Teams drowning in Slack screenshots that are unsearchable
- Anyone tired of "what was that dashboard screenshot called again?"
- Developers wanting multimodal RAG without $100/month bills
Why This Matters: The Cost Reality
| Provider | 10K Images/Month | What You Get |
|---|---|---|
| OpenAI Vision API | $100/month | Vision only (no search, no storage) |
| Google Vertex AI | $15/month | Vision only (no embeddings) |
| AWS Rekognition | $12/month | Labels only (no semantic search) |
| Pinecone + OpenAI | $120/month | Vision + Search (separate services) |
| This Project | $5/month | Vision + OCR + Embeddings + Hybrid Search + Storage |
The hidden cost: Most solutions charge separately for vision, embeddings, and search. This project includes everything on Cloudflare's edge.
The Problem With V2
V2 could find anything in text documents. But when clients uploaded:
- πΈ Dashboard screenshots
- π Technical diagrams
- π Scanned documents
- π Error message screenshots
The system was blind. It could only search by filename (screenshot-2026-01-15.png - useless) or manually added descriptions (which nobody bothers writing).
In 2026, text-only search feels like using a flip phone.
The Upgrade: Making RAG "See"
I added Llama 4 Scout (Meta's 17B multimodal model) to the stack. Now when you upload an image:
- Llama 4 Scout analyzes the pixels - generates a detailed description
- OCR extracts visible text - captures button labels, error messages, code
- BGE creates embeddings - makes it all searchable
- Stores in the same index - no separate image database needed
How It Works:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Image Upload (40KB) β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββ
β Llama 4 Scout β
β (Multimodal Model) β
βββββββββ¬ββββββββββββββββ
β
ββββββββββ΄βββββββββ
βΌ βΌ
ββββββββββββββββ ββββββββββββββββ
β Semantic β β OCR Text β
β Description β β Extraction β
β (1,865 chars)β β (1,043 chars)β
ββββββββ¬ββββββββ ββββββββ¬ββββββββ
β β
ββββββββββ¬ββββββββββ
βΌ
βββββββββββββββββ
β BGE Embedding β
β (384 dims) β
βββββββββ¬ββββββββ
βΌ
ββββββββββββββββββββββββ
β Vectorize + D1 β
β (Single Index) β
ββββββββββββ¬ββββββββββββ
βΌ
βββββββββββββββββββββββ
β Searchable! β
β β’ By meaning β
β β’ By text β
β β’ By similarity β
βββββββββββββββββββββββ
Processing time: 7.9 seconds
Search time: 900ms (first) β 0ms (cached)
The Code:
// The magic: Images become searchable text
const visionResponse = await env.AI.run('@cf/meta/llama-4-scout-17b-16e-instruct', {
messages: [{
role: 'user',
content: [
{ type: 'image_url', image_url: { url: `data:image/png;base64,${base64Image}` }},
{ type: 'text', text: 'Describe this image in detail for search indexing.' }
]
}]
});
// Combine semantic description + extracted text
const searchableContent = `${description}\n\nVisible Text: ${ocrText}`;
// Store in the same 384-dim index as regular documents
await env.VECTORIZE.upsert([{
id: imageId,
values: embedding,
metadata: { content: searchableContent, isImage: true }
}]);
Why this works:
β
Single unified index (images + text coexist)
β
Hybrid search still applies (Vector + BM25)
β
OCR makes screenshots searchable by visible text
β
Same $5/month cost
Why Not Use Separate OCR?
Short answer: Llama 4 Scout does OCR + semantic understanding in one call.
Long answer:
- Tesseract can't run on Workers - needs native binaries, breaks serverless
- Context matters - Llama 4 understands table structures, headers, layouts (Tesseract just dumps text linearly)
- Efficiency - One API call vs. two (vision + OCR separately)
- Fallback resilience - If OCR fails, semantic description still makes it searchable
Trade-off: Dedicated OCR might be 2-3% more accurate on printed text, but Llama 4's multimodal understanding gives better search results.
Performance: Before vs After
| Metric | V2 | V3 (Multimodal) |
|---|---|---|
| Search Types | Text only | Text + Images + Scanned Docs |
| Image Ingestion | β | β 7.9s (Llama 4 + OCR) |
| OCR Extraction | β | β 1,000+ chars (receipts, forms, diagrams) |
| Reverse Image Search | β | β 8s |
| Latency (text search) | ~900ms | ~900ms (unchanged!) |
| Latency (cached) | N/A | 0ms (new cache) |
| Cost | $5/month | $5/month |
| Document Types | Text, Code, Markdown | + Screenshots, Receipts, Forms, Diagrams |
Yes, the cost didn't change. Cloudflare's edge deployment means you're not paying for idle GPU time.
Real-World Test: Visual Bug Reports
I tested with actual use cases from my consulting work:
Test 1: Screenshot Search
Uploaded: Dashboard screenshot with metrics cards
Search query: "Find dashboards with performance metrics"
Result: β
Found 3 similar screenshots in 1.1s
What it matched:
- Description: "dashboard interface with metrics cards"
- OCR text: "Response Time: 847ms", "Throughput: 2.4K/s"
Test 2: Error Message Recognition
Uploaded: Screenshot of React error in browser console
Search query: "TypeError undefined property"
Result: β
Matched via OCR text extraction
OCR captured:
TypeError: Cannot read property 'map' of undefined
at ProductList.jsx:42
Test 3: Diagram Discovery
Uploaded: Architecture diagram with boxes and arrows
Search query: "microservices architecture"
Result: β
Matched via semantic description
Llama 4 described it as:
"Architecture diagram showing microservices pattern with API Gateway, Service Discovery, and multiple backend services connected via message queue"
Real-World Test: Financial Document Search
To prove this isn't just for tech screenshots, I threw it a real challenge: a Nigerian bank receipt with mixed English/abbreviations, account numbers, and structured financial data (40KB JPEG).
What Llama 4 Scout Extracted:
OCR (1,043 characters):
Transaction Amount: N30,000
Transaction Type: INTER-BANK
Sender: CHUKWUDI NWANERI
Beneficiary: BOBMANUEL CECILIA OGECHI
Account: 3113880181
Bank: First Bank of Nigeria
Reference: NXG000014260102194419228984379203
Semantic Description (1,865 characters):
"Transaction receipt from Access Bank, detailing a successful inter-bank transfer. The receipt is structured into sections with header, transaction details, sender/beneficiary information..."
Processing Time: 7.9 seconds (vision + OCR + embedding)
Search Results - 5 Different Queries:
I tested every searchable element. Every query found the receipt as the #1 result:
| Query | Result | Time |
|---|---|---|
| "N30000 transfer" | β #1 match | 1.2s |
| "BOBMANUEL CECILIA" | β #1 match | 609ms |
| "Access Bank transaction" | β #1 match | 527ms |
| "NXG000014260102194419228984379203" | β #1 match | 601ms |
| "Access Bank transaction" (repeat) | β #1 match | 0ms (cached!) |
This proves:
β
Semantic search - "N30000 transfer" matched without exact text
β
Name extraction - Found partial name "BOBMANUEL CECILIA"
β
Exact text matching - 30-character transaction reference found instantly
β
Cache working - Repeat query eliminated all latency
Use Cases This Enables:
π Receipt Management
Upload scanned receipts, invoices, bills. Search by:
- Amount: "show me transfers over N25000"
- Vendor: "find all Access Bank transactions"
- Date: "transactions from January 2026"
πΌ Financial Audit Trails
- Search transactions by reference number
- Find transfers by recipient name
- Track spending patterns across documents
π¦ Compliance & Bookkeeping
- Searchable transaction history without manual data entry
- Automated document categorization by bank/type
- Audit-ready record keeping with instant retrieval
π Privacy-First
Your financial documents never leave Cloudflare's network. No OpenAI API calls, no Google Cloud uploads - just edge processing.
All for $5/month.
The "I Tried CLIP and It Failed" Story
Initially, I wanted to use CLIP (OpenAI's vision-language model) for "true" visual embeddings. The plan was beautiful:
Image β CLIP β Visual embedding (512 dims) β Separate index
Problem: Cloudflare Workers AI doesn't support CLIP.
Error code 5018: "This account is not allowed to access this model."
After wasting a weekend on this, I realized something: For RAG use cases, descriptions work better than visual embeddings anyway.
Why?
- Descriptions are searchable by meaning ("red button") and text ("Submit")
- Visual embeddings only match pixel similarity (good for "find similar images", bad for "find the login screen")
- Single index is simpler than dual-index systems
Lesson learned: Sometimes the "clever" solution is worse than the simple one.
How This Compares to Multimodal Alternatives
| Feature | OpenAI Vision API | Google Vertex AI | This Project |
|---|---|---|---|
| Base cost | $0.01/image | $0.0015/image | Included in $5/month |
| OCR | Not included | Separate API ($1.50/1K pages) | Built-in |
| Hybrid search | No | No | β Vector + BM25 |
| Reranking | No | No | β Cross-encoder |
| Edge latency | 200-500ms | 300-600ms | ~900ms (first), 0ms (cached) |
| Data leaves network | β Yes | β Yes | β No (Cloudflare only) |
| Setup complexity | API integration | Complex SDK | wrangler deploy |
| Storage included | No (S3 separate) | No (GCS separate) | β D1 + Vectorize |
At scale (10K images/month):
- OpenAI Vision: $100/month (just for vision, excluding embeddings & storage)
- Google Vertex AI: $15/month (vision only) + $10/month (embeddings) + storage
- AWS Rekognition: $12/month (labels only) + separate search solution
- This stack: $5/month (everything included)
At scale (100K images/month):
- OpenAI: $1,000/month
- Google: $250/month
- This stack: ~$50/month (still 20x cheaper)
New Features in V3
1. Image Ingestion Endpoint
curl -X POST https://your-worker.dev/ingest-image \
-F "id=dashboard-001" \
-F "image=@screenshot.png" \
-F "category=ui-screenshots"
Response:
{
"success": true,
"documentId": "dashboard-001",
"description": "Dashboard interface with...",
"extractedText": "API Key\nEnter your API key\nTest...",
"performance": {
"multimodalProcessing": "4852ms",
"totalTime": "7737ms"
}
}
2. Reverse Image Search
Upload an image, find visually similar ones:
curl -X POST https://your-worker.dev/find-similar-images \
-F "image=@query.png" \
-F "topK=5"
Use cases:
- "Find screenshots that look like this"
- "Match product photos"
- "Locate similar diagrams"
3. 60-Second Cache (New!)
After rebuilding, I added caching. Same query within 60s? 0ms response.
First search: 929ms
Cached search: 0ms β¨
Real log output:
POST /search - Ok @ 3:33:00 PM
POST /search - Ok @ 3:33:02 PM
(log) Cache hit!
How it works:
- In-memory cache (not Workers KV - that adds 50-100ms latency)
- Caches final search results (~5KB per query)
- 60-second TTL (queries expire after 1 minute - balances freshness vs performance)
- Uses <1MB of Worker's 128MB RAM
4. Batch Embeddings (Optimization)
V2 generated embeddings sequentially (slow). V3 uses Promise.all():
Before: 3 chunks β 3 seconds (sequential)
After: 3 chunks β 1.2 seconds (parallel)
// V2: Sequential (slow)
for (const chunk of chunks) {
const emb = await env.AI.run('@cf/baai/bge-small-en-v1.5', {text: chunk});
vectors.push(emb);
}
// V3: Parallel (fast)
const embeddings = await Promise.all(
chunks.map(chunk => env.AI.run('@cf/baai/bge-small-en-v1.5', {text: chunk}))
);
The Tech Stack (Updated)
All still on Cloudflare's edge:
- Workers - Runtime (serverless, globally distributed)
- Vectorize - Vector database (384 dims, single unified index)
- D1 - SQL database for BM25 keywords
-
Workers AI:
-
@cf/meta/llama-4-scout-17b-16e-instruct(vision + OCR) -
@cf/baai/bge-small-en-v1.5(embeddings - 384 dims) -
@cf/baai/bge-reranker-base(cross-encoder reranking)
-
Why 384 dimensions?
- Tested: 384 dims achieves 66.43% MRR@5 vs 56.72% for semantic-only
- Upgrading to 768 dims only improves to ~68% (2% gain)
- But doubles cost and latency
- Better to use reranker (adds 9.3 percentage points for minimal cost)
No external APIs. No data leaving Cloudflare's network.
Deployment (Still 10 Minutes)
git clone https://github.com/dannwaneri/vectorize-mcp-worker.git
cd vectorize-mcp-worker
npm install
# Create resources
wrangler vectorize create mcp-knowledge-base --dimensions=384
wrangler d1 create mcp-knowledge-db
# Update wrangler.toml with database_id, then:
wrangler d1 execute mcp-knowledge-db --remote --file=./schema.sql
wrangler deploy
Test it:
# Upload an image
curl -X POST https://your-worker.dev/ingest-image \
-F "id=test-001" \
-F "image=@screenshot.png"
# Search by text
curl -X POST https://your-worker.dev/search \
-H "Content-Type: application/json" \
-d '{"query": "dashboard metrics", "topK": 5}'
What This Can't Do (Yet)
- Handwriting recognition - Llama 4 Scout struggles with cursive
- Complex math equations - LaTeX rendering in images isn't perfect
- Video analysis - Only processes static images (frame extraction coming in V4?)
- Non-Latin scripts - Haven't tested Arabic/Chinese/Cyrillic thoroughly
If your use case needs these, let me know in the comments - might prioritize them.
What's Next?
Considering:
- β Multimodal search (Done!)
- β 60-second cache (Done!)
- β Batch embeddings (Done!)
- Face/object detection (if there's demand)
- Video frame analysis
- PDF image extraction
But honestly? This covers 95% of multimodal RAG use cases.
Try It Live
- Dashboard: vectorize-mcp-worker.fpl-test.workers.dev/dashboard
- GitHub: github.com/dannwaneri/vectorize-mcp-worker
Upload a screenshot and search for it. You'll see why 2026 is the year RAG gets eyes.
Need help deploying this for your team? Hire me on Upwork
β Star the repo if this helps your project!
Questions? Drop them in the comments.
Top comments (7)
Great timing. I just spent my evening building an image analyzer with AWS Rekognition and Lambda. It is interesting to see how you tackled the 'giving code eyes' problem with Cloudflare and Llama instead.
The pivot from CLIP to text descriptions for RAG is a smart move for accuracy.
What I love most: You added images and you kept costs very low for what it can do!
The section about "CLIP failing" is the most valuable part here for me.
We usually only see the polished wins, not the dead ends.
I just battled some S3 event triggers and encoding bugs myself tonight. Debugging these integration edges is where the real learning happens. I am not used to coding myself since I come from a background of system integration not programming. It was very hard to get it running the way it was supposed to.
Great work @dannwaneri
thanks ali. really appreciate you following from chrome tabs to this.
the rekognition + lambda approach is solid - aws has the accuracy edge on pure OCR for sure. curious though: are you planning to make those analyzed images searchable later? that's where i hit the cost wall (rekognition analysis + bedrock embeddings + opensearch was pushing $150/month for what i needed).
the clip β text description pivot was frustrating (wasted a weekend on it) but yeah, for RAG use cases descriptions work better than visual embeddings anyway
how's rekognition handling complex layouts? receipts, forms, diagrams with mixed text/graphics? that's where llama 4's multimodal understanding surprised me . it gets context, not just character extraction
keep building π¨
That $150/month metric is a huge red flag for me. Thanks for the heads-up. I am strictly optimizing for Free Tier right now, so "OpenSearch" is out of the question.
If I make them searchable later, I would probably start small by just dumping the JSON labels into DynamoDB for basic filtering before looking at vector databases.
Regarding complex layouts: I haven't stress-tested that yet. Today was purely detect_labels (identifying objects like 'Laptop', 'Chair') just to prove the concept and see if I could get the pipeline running.
For receipts and forms, I suspect you are right: standard Rekognition/Textract might give me the text, but Llama likely wins on understanding the "semantic glue" between the fields without custom logic.
That was beyond the scope of what I tried out today, but definitely worth considering for a future project.
I will keep the cost wall in mind moving forward. Saving money on projects is critical for me.
the free tier optimization constraint is real. i hit the same wall which is why i went all-in on cloudflare.
the dynamodb filtering approach makes sense for basic queries but yeah, the moment you need "find images with similar layouts" or "dashboards showing performance metrics" you're back to needing embeddings.
the rekognition β textract path gives you accurate OCR but you're right about the semantic glue. llama 4 understanding "this is a receipt header vs line item vs total" without custom parsing logic is the unlock
if you ever want to test multimodal search without leaving free tier constraints, the stack i documented is basically: upload image β llama 4 scout (free on workers ai) β bge embeddings (free) β vectorize (free tier: 10M vectors). only costs when you scale past free limits
the $150/month metric came from real estimates when i priced out rekognition + bedrock + opensearch for a client project. aws pricing forced me to find alternatives
keep optimizing for free tier. that's how you learn what's actually essential vs nice-to-have π¨
That specific stack breakdown (Cloudflare + Vectorize with 10M vectors) is a goldmine.
I am currently deep in the AWS ecosystem for my certification journey, but ignoring that kind of Free Tier value would be foolish. Thanks for validating the 'semantic glue' theory regarding Llama vs. Textract.
I will definitely bookmark your article for when I hit the limits of my current JSON/DynamoDB approach. Real-world client estimates like your $150 example are the best reality check.
Consultants would take real money and a lot of it for your kind of intel! Thank you VERY much Daniel
appreciate that ali.
yeah the free tier values on workers ai are wild .cloudflare is basically subsidizing the learning curve right now. 10M vectors in vectorize before you pay anything is unreal compared to pinecone/qdrant pricing.
the aws cert journey makes total sense.
when you do hit the dynamodb filtering limits (and you will . everyone does around 5-10k images), the migration path is clean. your rekognition labels β llama 4 descriptions is mostly just swapping the vision model, embeddings flow the same way.
good luck with the cert! aws knowledge transfers well, you're just learning the cheaper edge compute alternative alongside it π¨
Good to know about that 5k-10k threshold. It is always better to know where the ceiling is before you hit your head on it.
I appreciate the insights today. It is rare to get this level of practical architectural advice in a comment section.
I just sent you a connection request on LinkedIn.
Would be great to keep up with your work there.