How Queues Saved Our API from Meltdown

#api #architecture #performance

Most developers don't think about queues until the day their API catches fire. You build an endpoint, it works fine, then someone sends a spike of traffic and suddenly everything slows down. Requests hang, CPU spikes, retries add even more load. It turns into chaos really fast.

This was us at OpenPanel. Our ingestion endpoint handles a lot of traffic. It grew slowly at first, then a few customers started sending big bursts and everything fell apart. Not gradually. It went from perfectly fine to completely overwhelmed in minutes. We didn't scale the API. We offloaded it.

Queues are not over-engineering

People like to say queues are only for giant enterprise systems. I don't really buy that. A queue is basically an async function that runs later instead of right now. That's all it is. You don't need microservices or some huge distributed setup to use one. You just need a spot where you push work so your API doesn't do everything inside the request lifecycle. It is simple, and it can completely change how your system behaves under load.

When queues actually help

There are a lot of good use cases, but these are the clearest ones.

Heavy background tasks. Sending emails, notifications, PDF generation. Any slow thing the user doesn't need to wait for.

Recurring jobs. Queues are a great way to run cron logic inside your own code instead of managing some external service. You get more control over retries and timing.

Better user experience. This one feels small but it is amazing in practice. Imagine a user writes a comment. Without a queue you would save the comment, then check who needs a notification, then send every single one. The user sits there waiting for all of this to finish. With a queue you save the comment and push a job. The API returns right away. Notifications get sent in the background. Huge UX win for basically no effort.

Ingestion endpoints. This is the big one for us. Our tracking endpoint used to validate events, transform them, fetch extra data, and insert everything. All inside one request. That works until someone sends 20 times the normal traffic. Then it stops working pretty fast.

Now the endpoint does one thing. It pushes the event into a queue and returns. Done. The user only cares that we received the request. They do not care what we do after that. This change made the whole ingestion layer stable even when we get huge spikes. If the workers fall behind it just means a delay. We do not lose data. The API stays fast. The queue absorbs the blow instead of the server falling over.

A simple BullMQ example

Here is a bare bones setup you can add to any Node API.

import { Queue, Worker } from 'bullmq'
import Redis from 'ioredis'
import http from 'http'

const redis = new Redis()
const queue = new Queue('jobs', { connection: redis })

http.createServer(async (req, res) => {
  if (req.url === '/track' && req.method === 'POST') {
    const body = await readBody(req)
    await queue.add('track', body)
    res.writeHead(200)
    res.end('ok')
    return
  }

  res.writeHead(404)
  res.end()
}).listen(3000)

new Worker('jobs', async job => {
  await processEvent(job.data)
}, { connection: redis })

You can even run the worker in the same process while prototyping. When you go to production you normally split them.

Using queues in a serverless world

Serverless makes this trickier because you can't keep workers alive. But there are solutions.

Trigger.dev gives you background jobs and retries without needing long running processes. It works with a bunch of runtimes and solves the exact problem that queues solve for traditional servers.

Vercel's Workflow API is also interesting. You write background workflows as simple functions. They run outside the request flow and persist on their own. If you are already using Vercel this feels like cheating. You get a lot of the benefits of a queue without running Redis or workers yourself.

Why queues mattered for us

Before we added queues, any traffic spike would make our ingestion API slow or unstable. After we added queues it became boring. And boring is exactly what you want in production.

Queues let us keep the API tiny and predictable. Workers do the heavy work. Traffic spikes turn into a backlog instead of downtime. This one change made OpenPanel a lot more reliable for everyone using it.

A quick note about OpenPanel

Queues have become a core part of how we keep OpenPanel fast. Our ingestion API gets a lot of traffic every day and queues turned it from something fragile into something we don't need to think about anymore. We use them for event ingestion, cron tasks, cleanup jobs, and anything that should not block the user. It is one of the reasons OpenPanel can handle heavy load without falling over.

If you want an open source analytics platform that can survive traffic without feeling fragile, take a look at it. It is fully self hosted and built for real world workloads.

In the next article I will talk about GroupMQ, our custom built queue library. It is a drop in alternative to BullMQ with first class support for grouped queues and ordered processing. These features saved us more than once, and I think a lot of teams will find them useful.