DEV Community

Zen
Zen

Posted on

Building a Durable Message Queue on SQLite for AI Agent Orchestration

I'm building a local AI agent orchestration tool. Multiple agents run asynchronously — user messages come in, executors return results, approvals fire callbacks. All of these are events, and they need to flow reliably through the system.

The key word is reliably. An agent crashes mid-task. The process gets killed. The machine restarts. When things come back up, pending events must still be there, ready to retry. Losing an event means losing work.

So I needed a durable message queue. But here's the constraint: this is a local developer tool, not a cloud service. Users install it with npm install and it should just work. Requiring them to run Redis or RabbitMQ — or any external process — is a non-starter. SQLite was already in the stack for persistent storage — and it has strong transactional guarantees. Could it also be the queue?

Why not just a "jobs" table?

The first instinct is to create a table with a status column and poll it:

SELECT * FROM jobs WHERE status = 'pending' ORDER BY created LIMIT 1;
-- process it, then:
UPDATE jobs SET status = 'done' WHERE id = ?;
Enter fullscreen mode Exit fullscreen mode

This works until it doesn't. Two consumers grab the same row. A consumer crashes after SELECT but before UPDATE — the job vanishes. You start adding locked_by, locked_at, retry counters, timeout checks... and you realize you're reinventing a message queue, poorly.

I found goqite, a Go library that does exactly this — a durable message queue on top of SQLite, modeled after Amazon SQS. It's a proven design. But there was nothing like it in the TypeScript ecosystem. So I built sqliteq.

The design

Single table, all queues

Every queue shares one table. No migrations when you add a queue — just pass a different name:

const emails = new Queue(db, 'emails')
const sync   = new Queue(db, 'sync-jobs')
Enter fullscreen mode Exit fullscreen mode

The schema:

create table if not exists sqliteq (
  id       text primary key default ('m_' || lower(hex(randomblob(16)))),
  created  text not null default (strftime('%Y-%m-%dT%H:%M:%fZ')),
  queue    text not null,
  body     text not null,
  timeout  text not null default (strftime('%Y-%m-%dT%H:%M:%fZ')),
  received integer not null default 0,
  priority integer not null default 0
) strict;
Enter fullscreen mode Exit fullscreen mode

No status column. The timeout timestamp is the status. If timeout is in the past, the message is available. If it's in the future, it's claimed by a consumer and invisible to everyone else. This is the SQS visibility timeout model.

Atomic claim — no locks needed

Receiving a message is a single SQL statement:

update sqliteq
set timeout = ?, received = received + 1
where id = (
  select id from sqliteq
  where queue = ? and ? >= timeout and received < ?
  order by priority desc, created
  limit 1
)
returning id, body, received
Enter fullscreen mode Exit fullscreen mode

The subquery finds the next eligible message. The outer UPDATE claims it by bumping the timeout into the future — atomically. No advisory locks. No two-step SELECT-then-UPDATE race. SQLite's single-writer guarantee does the rest.

Fencing with received

Here's a subtle problem: consumer A receives a message, takes too long processing it, and the visibility timeout expires. Consumer B receives the same message (redelivery). Now A finishes and deletes the message — but it's B's message now.

The received column solves this. It increments on every receive. When you delete or extend a message, you must pass the received value you got:

const msg = queue.receive()
// msg.received === 1

// later:
queue.delete(msg.id, msg.received)
// DELETE ... WHERE id = ? AND received = ?
Enter fullscreen mode Exit fullscreen mode

If the message was redelivered to someone else, received no longer matches, the DELETE affects zero rows, and you get false back. Deletes and extensions are conditional on both id and received, so a stale consumer can at most fail to delete a message — never delete one it doesn't own. This is the same idea as a fencing token in distributed locking.

Dead letters

After maxReceive deliveries (default 3), a message is no longer eligible for receive. It's dead-lettered — still in the table but won't be delivered again. You can inspect it, requeue it, or purge it:

const dead = queue.deadLetters()
// [{ id, body, received: 3 }]
Enter fullscreen mode Exit fullscreen mode

Benchmarks

On a MacBook Pro M-series, file-backed SQLite (not :memory:), with better-sqlite3:

Operation Throughput Latency
send + receive + delete ~20,000 ops/sec 49 μs/op
send only ~31,000 ops/sec
receive + delete (100K rows, 10 queues) ~18,000 ops/sec 55 μs/op
sendBatch (100 msgs/tx) ~120,000 ops/sec

For a local tool or single-machine service, this is more than enough. The bottleneck in my AI agent system is the LLM API call (seconds), not the queue (microseconds).

Quick start

npm install @minnzen/sqliteq better-sqlite3
Enter fullscreen mode Exit fullscreen mode
import Database from 'better-sqlite3'
import { Queue } from '@minnzen/sqliteq'

const db = new Database('app.db')
const q = new Queue(db, 'tasks')

// Send
q.send({ type: 'process', payload: 'data' })

// Receive and acknowledge
const msg = q.receive()
if (msg) {
  console.log(msg.body) // { type: 'process', payload: 'data' }
  q.delete(msg.id, msg.received)
}
Enter fullscreen mode Exit fullscreen mode

There's also a Processor class for long-running consumers with auto-retry and auto-extend:

import { Processor } from '@minnzen/sqliteq'

const processor = new Processor(q, {
  handler(msg) {
    // return normally = auto-delete
    // throw = leave for retry after timeout
  },
  concurrency: 4,
})

processor.start()
Enter fullscreen mode Exit fullscreen mode

Wrapping up

sqliteq is MIT licensed, zero runtime dependencies, and works with both better-sqlite3 and bun:sqlite. TypeScript with full generic support (Queue<T>).

The queue is currently powering the event-driven runtime of an AI agent orchestration project I'm working on — where agents crash, timeout, and retry are not edge cases but the normal operating mode. Having a durable queue that's just SQLite, with no external processes to manage, has been exactly the right trade-off. In that world, the queue is the backbone — everything else is optional.

If you have a similar "I just need a reliable local queue" situation, give it a try.

Top comments (0)