DEV Community: Vitalii

When NOT to use RAG (lessons from building a Claude-powered support bot)

Vitalii — Tue, 28 Apr 2026 17:26:36 +0000

Every tutorial about building AI chatbots reaches for the same starter pack: vector database, embeddings model, similarity search, RAG. I did too. Then I ran the numbers on prompt caching and threw most of it out.

Here's what happened.

The setup

I'm building a customer support bot for a B2C SaaS product. It hooks into Crisp (live chat), reads incoming customer messages, looks up answers in a knowledge base, and replies — escalating to a human when it can't help.

The stack:

Bun server (one file, ~300 lines)
Claude Sonnet 4.6 for the LLM
Supabase pgvector for vector search (initially)
OpenAI text-embedding-3-small for query embeddings

The knowledge base: 16 markdown articles covering account, billing, and technical topics. ~250 tokens per article, ~4,000 tokens total.

The "obvious" architecture: RAG

Standard playbook:

Chunk each article into ~500-token pieces
Embed each chunk with OpenAI
Store in Supabase pgvector
On each customer message: embed the message, do a similarity search, retrieve top 3 chunks, inject into the LLM prompt

Per request:

1 OpenAI embedding call (~200 ms)
1 Supabase RPC (~150 ms)
1 Claude API call (~800 ms)
~1,150 ms before the customer sees a single character

It worked. But it kept feeling over-engineered for what is fundamentally 4,000 tokens of static text.

The thing I kept circling back to

Anthropic's prompt caching. The deal:

Pay 1.25× the normal input rate to write a prefix into cache
Pay 0.1× (90% off) for every subsequent read within 5 minutes
The TTL refreshes on every read — so an active session keeps the cache warm indefinitely

Minimum cacheable block on Sonnet 4.6: 1,024 tokens. My KB is 4,000+. Comfortably above.

What if I just stuffed the whole KB into the system prompt and cached it?

Conventional wisdom says no — too many tokens per call. But the math with caching is different than the math without.

Running the numbers

Scenario: 10 customers per day, 10 messages each. 100 messages across 10 sessions.

RAG (the original setup):

1,380 input tokens per call (system + retrieved chunks + history + user msg)
0% cache hit rate — the system prompt is below the 1,024 cache minimum
~$17/month total

All-in-context with prompt caching:

4,330 input tokens per call (system + entire KB + history + user msg)
~90% cache hit rate after the first message of each session
~$18/month total

About a buck different. The interesting part isn't the cost — it's everything else.

The trade-off table

	RAG	All-in-context + cache
Cost (100 msgs/day)	$17/mo	$18/mo
First-token latency (cache hit)	~1,150 ms	~700 ms
Lines of code	~250	~150
External services	Supabase + OpenAI + Anthropic	Anthropic only
Retrieval failures	Possible (threshold tuning hell)	Impossible — KB always visible
Cross-article reasoning	Limited to top-K chunks	Sees the whole KB
Scales to 100 articles	Yes	Yes
Scales to 10,000 articles	Yes	No — context window limit

For my use case — small KB, conversational chat, frequent sessions — all-in-context is faster, simpler, and handles harder cross-article questions better. The cost difference is in the noise.

The implementation

I made it switchable via one env var so I could A/B compare:

const KB_MODE = Bun.env.KB_MODE === "inline" ? "inline" : "rag";
const PROMPT_FILE = KB_MODE === "inline" ? "CRISP-inline.md" : "CRISP-rag.md";

const SYSTEM_PROMPT = (await Bun.file(PROMPT_FILE).text()).trim();

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  system: [{
    type: "text",
    text: SYSTEM_PROMPT,
    cache_control: { type: "ephemeral" }, // ← the magic line
  }],
  messages,
});

CRISP-rag.md is just the persona/rules (~600 tokens — too small to cache).
CRISP-inline.md is persona + entire KB baked in (~2,500 tokens — caches happily).

In inline mode, searchKB() is never called. No embedding round-trip, no Supabase query. The KB is sitting in the system prompt, cached on Anthropic's side, ready to be reused for every subsequent message.

The cache logging proves it:

[claude] response in 1432ms (in: 47, out: 92, cache_create: 2447, cache_read: 0)    ← 1st msg
[claude] response in 712ms  (in: 47, out: 88, cache_create: 0,    cache_read: 2447) ← cached
[claude] response in 689ms  (in: 47, out: 95, cache_create: 0,    cache_read: 2447) ← cached

After the first message, every reply is ~50% faster and 90% cheaper on input.

What I learned

RAG isn't free. It adds two API hops, an embedding model, a vector database, chunking logic, threshold tuning, and an entire class of "the right chunk wasn't retrieved" failure modes. It's the right answer above some KB size — but that size is way bigger than most tutorials assume.

Prompt caching changes the break-even point. Without caching, stuffing 4,000 tokens into every request is wasteful. With caching, it's nearly free after the first call. The 1,024-token minimum is the only real gate.

A rough heuristic:

KB size	Recommendation
< 50k tokens	Start with all-in-context + caching. You probably don't need a vector DB.
50k–200k tokens	Hybrid — cache a "core" set of always-relevant content, RAG the long tail.
> 200k tokens	RAG is mandatory (context window limit).

For the typical "I have a few dozen markdown files" scenario, you almost certainly don't need a vector database.

Caveats

This isn't a universal "RAG is dead" take. RAG still wins when:

KB is genuinely large (thousands of articles)
KB updates constantly (every edit invalidates every cache)
Different customers need different KB subsets (caching is org-scoped, not user-scoped)
You need precise per-chunk attribution

But for a small product with a small KB? Reach for prompt caching first. It's a one-line change (cache_control: { type: "ephemeral" }) with measurable wins, and you can always add RAG later when your KB grows into it.

The most useful thing I did was build a switch. Don't make this a religious choice — measure both for your specific traffic shape and let the numbers decide.

From Netdata Inspiration to SaaS MVP: Server Monitoring with Bun + Claude Code Opus 4.6

Vitalii — Sat, 25 Apr 2026 17:53:50 +0000

If you've ever set up Netdata, you know that feeling — hundreds of real-time charts, per-second granularity, metrics you didn't even know your kernel exposed. It's a wonderful piece of software, genuinely one of the best open-source monitoring tools out there.

But here's the thing: I run a small fleet of CDN servers. I don't need 2,000 charts. I need to glance at a single dashboard and know: are my servers healthy or not?

So I built my own lightweight version. And my co-pilot for this entire build was Claude Code Opus 4.6.

This is the story of how it went — from reading /proc files with zero npm dependencies to a working SaaS-ready monitoring dashboard.

What I Built

The system has three components:

1. cdn-agent — A tiny Bun process that runs on each server. It reads Linux /proc files every 10 seconds and POSTs the metrics to my backend. Zero npm dependencies.

2. Backend API — A Bun server that ingests metrics into PostgreSQL (Supabase) and serves aggregated time-series data to the dashboard.

3. Dashboard — A React SPA with live gauges, alert cards, and 10 historical charts per server.

cdn-agent (10s) ──POST──> Backend API ──> Supabase (PostgreSQL)
                                               │
Dashboard (React) <──── GET aggregated data ───┘

The Agent: Zero Dependencies, Pure `/proc`

This is probably my favorite part. The monitoring agent has no runtime dependencies — just Bun reading the Linux virtual filesystem directly.

Here's the entire main loop:

import { collectCpu } from './collectors/cpu'
import { collectMemory } from './collectors/memory'
import { collectDisks } from './collectors/disk'
import { collectNetwork } from './collectors/network'
import { collectProcesses } from './collectors/processes'
import { collectSystem } from './collectors/system'
import { sendMetrics } from './sender'

const INTERVAL_MS = 10_000

async function collect() {
    const [cpu, memory, drives, network, processes, system] =
        await Promise.all([
            collectCpu(),
            collectMemory(),
            collectDisks(),
            collectNetwork(),
            collectProcesses(),
            collectSystem(),
        ])

    return { ...cpu, ...memory, ...system, drives, network, ...processes }
}

async function loop() {
    // First collection is a warmup — CPU/disk/network
    // deltas need a previous snapshot to calculate rates
    await collect()
    console.log('[cdn-agent] Warmup done, starting main loop')

    while (true) {
        await Bun.sleep(INTERVAL_MS)
        const data = await collect()
        await sendMetrics(data)
    }
}

loop()

Six collectors run in parallel via Promise.all, each responsible for one slice of the system:

Collector	Source	What it reports
CPU	`/proc/stat`	Usage %, I/O Wait %
Memory	`/proc/meminfo`	Used/Total/Cached RAM, Swap
Disk	`df` + `/proc/diskstats`	Per-drive usage, read/write MB/s
Network	`/proc/net/dev`	Per-interface RX/TX MB/s, errors, drops
Processes	`/proc/[pid]/stat`	Top 5 by CPU, Top 5 by memory
System	`/proc/loadavg`, `/proc/uptime`	Load avg, uptime, TCP connections

Notice the warmup pattern — the first collection runs but its results are thrown away. Why? Because metrics like CPU usage and network throughput are calculated as deltas between two snapshots. The first run has no "previous" to compare against, so it would always report 0%. One dummy collection solves that.

Here's how the CPU collector works — 33 lines, no dependencies:

let prevIdle = 0
let prevIowait = 0
let prevTotal = 0

export async function collectCpu() {
    const stat = await Bun.file('/proc/stat').text()
    const parts = stat.split('\n')[0]!.split(/\s+/).slice(1).map(Number)

    const idle = parts[3]! + parts[4]!   // idle + iowait
    const iowait = parts[4]!
    const total = parts.reduce((a, b) => a + b, 0)

    const diffIdle = idle - prevIdle
    const diffIowait = iowait - prevIowait
    const diffTotal = total - prevTotal

    prevIdle = idle
    prevIowait = iowait
    prevTotal = total

    if (diffTotal === 0) return { cpu_percent: 0, iowait_percent: 0 }

    return {
        cpu_percent:
            Math.round(((diffTotal - diffIdle) / diffTotal) * 100 * 100) / 100,
        iowait_percent:
            Math.round((diffIowait / diffTotal) * 100 * 100) / 100,
    }
}

Bun.file('/proc/stat').text() — that's all it takes to read kernel CPU counters. No child_process, no exec, no parsing library. Just read the file and do the math.

The Dashboard: 10 Charts, One Page

The server detail page packs a lot of information into a single view:

Circular gauges for CPU and RAM (green/yellow/red based on thresholds)
Live stats for network throughput, load average, connections
10 historical charts — CPU, memory, network TX/RX, disk I/O, connections, load avg, utilization, errors
Time range selector — 15min, 1h, 6h, 24h, 7 days

Smart Time-Range Aggregation

One of the trickier problems: how do you show 7 days of data collected every 10 seconds without drowning the browser in 60,000+ data points?

The backend handles this with in-memory bucketing:

const rangeConfig = {
    '15m': { minutes: 15,    bucketSeconds: 10 },    // raw data
    '1h':  { minutes: 60,    bucketSeconds: 10 },    // raw data
    '6h':  { minutes: 360,   bucketSeconds: 60 },    // 1-min averages
    '24h': { minutes: 1440,  bucketSeconds: 300 },   // 5-min averages
    '7d':  { minutes: 10080, bucketSeconds: 1800 },   // 30-min averages
}

For short ranges (15m, 1h), the raw 10-second data goes straight to the chart. For longer ranges, the backend fetches all raw rows, groups them into time buckets, and averages the numeric fields:

// Bucket metrics by time intervals
for (const row of rows) {
    const t = new Date(row.ts).getTime()
    const bucketKey =
        Math.floor(t / (config.bucketSeconds * 1000))
        * (config.bucketSeconds * 1000)

    if (!buckets.has(bucketKey)) {
        buckets.set(bucketKey, {
            ts: new Date(bucketKey).toISOString(),
            points: [],
        })
    }
    buckets.get(bucketKey)!.points.push(row.data)
}

// Average each bucket
const aggregated = Array.from(buckets.values()).map((bucket) => ({
    ts: bucket.ts,
    data: averageMetrics(bucket.points),
}))

No pre-aggregation tables, no materialized views, no time-series database. Just PostgreSQL with a JSONB column and a few lines of bucketing logic. For a handful of servers, this works perfectly — and it's one less thing to maintain.

Dynamic Network Charts

Another nice pattern: the network charts build themselves based on whatever interfaces the server actually has. No hardcoded eth0 or ens3:

const networkData = metrics.map(m => {
    const row: Record<string, unknown> = { ts: m.ts }
    for (const n of m.data.network) {
        row[`${n.iface}_tx`] = n.tx_mb_s
        row[`${n.iface}_rx`] = n.rx_mb_s
    }
    return row
})

If a server has eth0 and eth1, you get two lines on the chart. If another server has ens3, that's what shows up. The dashboard adapts to whatever the agent reports.

Alert System

The overview page shows all servers as cards with color-coded borders and alert badges:

Thresholds are explicit and layered:

Metric	Warning	Critical
CPU	> 80%	> 95%
RAM	> 85%	> 95%
Disk	> 90%	> 95%
Swap	> 50%	> 80%
I/O Wait	> 20%	> 40%
Offline	—	last seen > 30s ago

The "online" check is probably the simplest pattern in the whole system, and one I'm quite happy with:

online: server.last_seen_at
    ? Date.now() - new Date(server.last_seen_at).getTime() < 30_000
    : false

No heartbeat daemon, no WebSocket connection tracking. The agent sends metrics every 10 seconds — if we haven't heard from it in 30 seconds, it's offline. Computed on-the-fly, never stored.

The SaaS Angle

This started as a tool for my own infrastructure, but I realized it has legs as a product. If you're running 2-10 servers — maybe a small startup, a side project with a VPS, or a self-hosted setup — you probably don't want to set up Prometheus + Grafana or pay for Datadog.

What you want is:

A single Bun script you can scp to your server
A dashboard that shows red/yellow/green at a glance
Historical charts for when something goes wrong at 3am
7-day retention so you can spot trends

That's what this is. The agent deploys in 3 commands:

scp -r cdn-agent root@server:/opt/cdn-agent
# On the server:
echo 'AGENT_KEY=xxx\nAGENT_ENDPOINT=https://api.example.com/api/metrics-ingest' > .env
pm2 start bun --name cdn-agent -- run src/index.ts

Building with Claude Code Opus 4.6

I want to be transparent: this feature was built almost entirely in collaboration with Claude Code Opus 4.6. Not as a code autocomplete — as an actual architectural partner.

Here's what that looked like in practice:

Architecture decisions: I described what I wanted ("a lightweight Netdata for my CDN servers"), and we iterated on the three-component design together. The in-memory bucketing approach instead of a time-series DB was Claude's suggestion after I explained my scale (~5 servers, 7-day retention).
The /proc collectors: Claude knew the exact format of /proc/stat, /proc/meminfo, /proc/net/dev and how to parse them. The delta-based calculation pattern for CPU and network throughput came out correct on the first try.
The warmup pattern: When I noticed the first data point was always zero, Claude immediately identified the cause (no previous snapshot for delta calculation) and suggested the warmup loop — a clean solution I might not have thought of as quickly.
Speed: The entire feature — agent, backend endpoints, dashboard with 10 charts — came together in a focused session. That's not weeks of development compressed into hours. It's a different way of working, where you're constantly iterating on a working system instead of staring at a blank file.

It's not perfect. Some of the chart styling needed manual tweaking. The alert thresholds are currently hardcoded (they should be configurable). But as a tool for going from idea to working product, Claude Code is genuinely impressive.

What's Next

WebSocket for real-time updates — Currently the dashboard polls every 60s. Live streaming would make it feel more like Netdata.
Configurable alert thresholds — Per-server, via the dashboard UI.
Notifications — Telegram/email alerts when a server goes critical.
Public SaaS launch — If there's enough interest, I'd love to open this up.

If you're building monitoring tools, or if you've used Claude Code for a full-feature build, I'd love to hear about your experience in the comments.

Built with Bun, React, Recharts, Supabase, and Claude Code Opus 4.6.