Aakash Gour

Posted on May 8

Designing a Content Microservice: Architecture Decisions and Trade-offs

#webdev #javascript #ai #showdev

The first version of PostAll's content generation system wasn't a microservice. It was a Next.js API route that called OpenAI, wrote the result to a database, and returned a response. It took 8 seconds per article. Our first beta user queued 200 articles and the server timed out after the 11th.

That's the moment you stop thinking about "how do I generate content" and start thinking about "how do I build a system that generates content." They're completely different problems.

This is the architecture I landed on after three rewrites — what I chose, why I chose it, and where each approach broke before I got there.

Why content generation is a bad fit for synchronous APIs

Most API routes work like this: request comes in, thing happens, response goes out. The whole exchange is sub-second. Content generation breaks this model in three ways.

Latency. A GPT-4o call for a 1,000-word article takes 8–15 seconds. That's not a timeout edge case — that's the normal case. No HTTP client should sit open for 15 seconds waiting.

Failure modes. LLM APIs fail. Rate limits hit. Tokens run out mid-response. At low volume, you catch these with a try/catch and a retry. At scale, you need state — you need to know which of 500 jobs failed, why, and whether it's safe to retry.

Fan-out. A user submitting 100 articles isn't submitting 100 requests — they're submitting 1 request that should spawn 100 coordinated jobs. Synchronous APIs can't express that without blocking.

The solution to all three is the same: decouple submission from execution with a job queue.

The queue design that actually held up

I went through two queue implementations before landing on something stable.

First attempt: in-memory queue with setInterval.

// What I started with — don't do this
const jobQueue = [];

app.post('/generate', (req, res) => {
  const jobId = uuid();
  jobQueue.push({ id: jobId, ...req.body });
  res.json({ jobId });
});

setInterval(async () => {
  const job = jobQueue.shift();
  if (job) await processJob(job);
}, 100);

This lasted until the first server restart. Everything in the queue evaporated. Lesson learned: queue state needs to survive the process.

Second attempt: database as queue.

// Using Postgres as a job queue — better, but read the gotchas below
async function enqueueJob(payload) {
  const { rows } = await db.query(
    `INSERT INTO content_jobs (id, status, payload, priority, created_at)
     VALUES ($1, 'pending', $2, $3, NOW())
     RETURNING id`,
    [uuid(), JSON.stringify(payload), payload.priority ?? 0]
  );
  return rows[0].id;
}

async function claimNextJob(workerId) {
  // SELECT FOR UPDATE SKIP LOCKED is critical here —
  // without it, multiple workers race on the same row
  const { rows } = await db.query(`
    UPDATE content_jobs
    SET status = 'processing', worker_id = $1, claimed_at = NOW()
    WHERE id = (
      SELECT id FROM content_jobs
      WHERE status = 'pending'
      ORDER BY priority DESC, created_at ASC
      FOR UPDATE SKIP LOCKED
      LIMIT 1
    )
    RETURNING *
  `);
  return rows[0] ?? null;
}

The FOR UPDATE SKIP LOCKED clause is what makes this safe for concurrent workers. Without it, two workers claim the same job, you get duplicated output, and debugging it is a nightmare. I hit this at 3 workers — it got worse fast.

What I use now: dedicated queue (BullMQ + Redis).

For teams already running Redis, BullMQ solves priority queues, retries, rate limiting per worker, and dead-letter queues out of the box. The database-as-queue approach is legitimate for low volumes, but past ~50 jobs/minute the polling overhead shows up in your database metrics.

import { Queue, Worker } from 'bullmq';
import Redis from 'ioredis';

const connection = new Redis(process.env.REDIS_URL);

export const contentQueue = new Queue('content-generation', {
  connection,
  defaultJobOptions: {
    attempts: 3,
    backoff: { type: 'exponential', delay: 2000 },
    removeOnComplete: { count: 1000 }, // keep last 1000 for debugging
    removeOnFail: false,               // keep all failures — you'll want to inspect them
  },
});

// Priority lanes: 10 = high (single article), 1 = low (bulk batch)
export async function enqueue(payload, priority = 1) {
  return contentQueue.add('generate', payload, { priority });
}

The removeOnFail: false line is important. Failed jobs tell you things — which prompts consistently fail, which users hit edge cases, where rate limiting is biting you. Don't throw them away.

Worker architecture: three lanes, not one

The instinct is to have one worker that does everything: calls the LLM, formats the output, validates it, writes to the database. This works until you need to scale generation without scaling formatting, or you want to add a quality scoring step without touching the generation logic.

I split workers into three types:

Generation worker  → calls LLM, stores raw output
Formatting worker  → transforms raw → target format (HTML, Markdown, JSON)
Post-processing    → SEO checks, quality scoring, webhook dispatch

Each one pulls from its own queue. Generation output triggers a formatting job. Formatting completion triggers post-processing. This means:

You can scale generation horizontally when you're rate-limit-bound
You can swap the formatting logic without touching generation
A formatting bug doesn't take down generation

// Generation worker — one responsibility
const generationWorker = new Worker(
  'content-generation',
  async (job) => {
    const { keyword, tone, maxTokens, outputFormat } = job.data;

    const raw = await generateContent({ keyword, tone, maxTokens });

    // Store raw output immediately — if formatting fails, we have the source
    const articleId = await db.saveRawContent({
      jobId: job.id,
      raw,
      metadata: { keyword, tone, outputFormat },
    });

    // Trigger next stage
    await formattingQueue.add('format', { articleId, outputFormat });

    return { articleId };
  },
  {
    connection,
    concurrency: 5, // tune based on your API tier rate limits
    limiter: {
      max: 50,       // OpenAI tier 2: 5000 RPM → ~83 RPS; leave headroom
      duration: 60000,
    },
  }
);

The concurrency: 5 isn't arbitrary. On OpenAI's Tier 2 with GPT-4o, each call takes 8–12 seconds. At 5 concurrent workers, you're making 25–37 requests per minute — well within limits. Push to 15 and you'll start seeing 429s, which trigger retries, which increase your effective request count, which makes the 429s worse. Find your actual throughput ceiling before tuning this.

The caching decision I almost got wrong

My first instinct was to cache LLM responses by prompt hash. If the same exact prompt comes in twice, return the cached result. Simple, right?

The problem: content generation prompts are almost never identical. Even "write a blog post about React hooks" varies by tone parameter, target length, style instruction. My cache hit rate after a week was 2.3%. The Redis overhead wasn't worth it.

What is worth caching:

Prompt templates. If your prompts are built from templates + variables, cache the rendered template structure, not the final output. Template rendering is cheap but if you're doing it 500 times for a bulk job, it adds up.

External data fetches. If your prompts include fetched data (product descriptions, SEO keywords from an API), cache those aggressively. An article generation job that fetches 3 external APIs per article doesn't need to make 1,500 API calls for a batch of 500.

Quality scoring results. If you run a readability scorer or SEO checker on output, cache by content hash. The same output will score the same way every time.

async function getPromptTemplate(templateId) {
  const cacheKey = `template:${templateId}`;
  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);

  const template = await db.getTemplate(templateId);
  // Templates change infrequently — 1 hour TTL is safe
  await redis.setex(cacheKey, 3600, JSON.stringify(template));
  return template;
}

Monitoring: the metrics that actually matter

I spent two weeks looking at the wrong things. CPU and memory on the worker boxes stayed flat. What I should have been watching:

Queue depth over time. If your queue is growing faster than it's draining, you have a capacity problem. You want to alert when pending_jobs > (worker_count × average_job_duration × 2).

Job age at processing time. This is different from queue depth. A queue of 100 jobs with 5-second average age is healthy. A queue of 100 jobs with 45-minute average age means your throughput has been underwater for an hour.

Token cost per job. LLM costs are the primary cost driver for PostAll. Tracking average token usage per job type lets you catch prompt bloat before it shows up in your OpenAI invoice.

Dead-letter queue size. If your DLQ grows, something systematic is failing — not just occasional network hiccups. I check this daily; spikes always point to something actionable.

// Expose these as a /metrics endpoint — pipe to Grafana or Datadog
async function getQueueMetrics() {
  const [waiting, active, failed, delayed] = await Promise.all([
    contentQueue.getWaitingCount(),
    contentQueue.getActiveCount(),
    contentQueue.getFailedCount(),
    contentQueue.getDelayedCount(),
  ]);

  const oldestJob = await contentQueue.getJobs(['waiting'], 0, 0);
  const jobAgeSeconds = oldestJob[0]
    ? (Date.now() - oldestJob[0].timestamp) / 1000
    : 0;

  return { waiting, active, failed, delayed, jobAgeSeconds };
}

What I'd do differently

Start with the database queue, not BullMQ. BullMQ is great, but it's an additional dependency with its own failure modes. For the first 10,000 jobs, Postgres as a queue is completely fine and removes Redis as a point of failure. Migrate when you have a concrete reason.

Design the webhook contract on day one. PostAll's clients poll for job status. I built the webhook system as an afterthought and had to retrofit it. The async nature of the system means clients need a push notification — don't make them poll. Define the webhook payload schema before you write the first worker.

Add job metadata fields you don't need yet. organization_id, triggered_by, cost_usd, model_used. These fields are free to store and priceless when you're debugging a production issue at 11pm and need to know which tenant generated a specific piece of content and how much it cost.

What it handles now

PostAll's current architecture processes 500 articles per hour across 8 generation workers, 4 formatting workers, and 2 post-processing workers. P95 end-to-end latency — from API submission to webhook delivery — is 47 seconds for standard articles. That's not fast, but it's reliable, it's observable, and when something breaks, the failure is contained to one stage.

The clients get a job ID immediately. The rest is async. That's the only architecture that makes sense for workloads this slow.

What's your queue strategy for async workloads? I've seen teams reach for Kafka for this and pay a complexity tax they didn't need. Curious where that decision point actually is — drop it in the comments.