DEV Community

Cover image for The Hidden Complexity of "Simple" Text Generation at Scale
Aakash Gour
Aakash Gour

Posted on

The Hidden Complexity of "Simple" Text Generation at Scale

What developers don't realize until their queue is on fire?

I thought I understood the problem.
You've got a list of inputs. You send each one to an LLM API. You get text back. You store it. Done — right?

That's what I believed until I tried to run it at scale. Not "scale" in the Silicon Valley sense. Just: 400 product descriptions for a client's e-commerce migration, needed by Thursday, with a two-person team and a $30 API budget.

The naive version worked fine for the first 40. Then it quietly started failing in ways that took me four hours to even name.
This is what bulk text generation actually involves — the parts that don't show up in any "getting started with the OpenAI API" tutorial.

The four problems nobody warns you about

  1. Rate limits are not what you think they are The OpenAI rate limit docs say: "You can make X requests per minute." That sounds like a traffic light — green until you hit the limit, then red. It's not a traffic light. It's more like a leaky bucket with multiple dimensions.

You're rate-limited by requests per minute, tokens per minute, and (depending on your tier) tokens per day. These limits apply independently. You can be under the RPM limit and still get a 429 if your average request is 3,000 tokens and you're sending 20 of them in 60 seconds.

Here's what this looks like in practice. You fire off 50 requests concurrently. The first 30 succeed. Requests 31-50 hit the token-per-minute ceiling and fail with a 429. Your retry logic catches them and queues them for the next minute.

But by then, the first 30 have already written to your database — so when the retries run, you've got logic decisions to make: do you re-check for existing records? Do you deduplicate by input hash? Do you trust that your storage layer handled concurrent writes correctly?

The rate limit isn't just a throttle. It's a consistency problem wearing a throttle costume.
Here's the retry wrapper I ended up with after two failed simpler versions:

async function generateWithRetry(prompt, options = {}) {
  const {
    maxRetries = 5,
    baseDelay = 1000,
    maxDelay = 32000,
  } = options;

  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const response = await openai.chat.completions.create({
        model: "gpt-4o",
        messages: [{ role: "user", content: prompt }],
        max_tokens: 1000,
      });

      return response.choices[0].message.content;

    } catch (error) {
      if (error.status === 429) {
        // Exponential backoff with jitter — without jitter, all retrying
        // requests hit the API at the same time and cause a retry thunderstorm
        const jitter = Math.random() * 500;
        const delay = Math.min(baseDelay * Math.pow(2, attempt) + jitter, maxDelay);

        console.warn(`Rate limited. Retry ${attempt + 1}/${maxRetries} in ${Math.round(delay)}ms`);
        await sleep(delay);
        continue;
      }

      // Don't retry on non-rate-limit errors
      throw error;
    }
  }

  throw new Error(`Failed after ${maxRetries} retries`);
}

function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}
Enter fullscreen mode Exit fullscreen mode

The jitter is the part I got wrong first. Without it, if 20 requests all fail at the same time and all wait exactly 2 seconds before retrying, you've created a synchronized retry storm that hits the API at the exact moment your rate limit window resets — and you'll 429 again immediately. Add randomness. It's not elegant, but it works.

2. Context windows are a constraint you plan around, not react to
The GPT-4o context window is 128,000 tokens. That sounds like infinity. It's not.
Here's a real scenario: You're generating long-form blog posts from an outline + brand guidelines + source material + few-shot examples. You want to maintain consistent tone across 50 posts in the same batch, so you include the style guide in every prompt.

The style guide is 2,000 tokens. Your source material per article is 1,500 tokens. Your few-shot examples are 3,000 tokens. Your actual prompt instructions are 500 tokens.

That's 7,000 tokens of input before you've written a single word of output. Multiply by 50 requests, and you're spending $X just on the input tokens that never appear in the final output.
This is the moment you realize you need to think about token budgets the way you think about memory budgets.

What I ended up doing: Compressing the style guide into a structured format, caching it as a system message, and treating it as a constant cost rather than a per-request variable.

Then auditing every few-shot example to ask: "Is this example earning its tokens?" Some examples that looked useful turned out to add almost nothing to output quality — just token spend.
The practical tool here is a token counting function that runs before every API call:

import { encoding_for_model } from "tiktoken";

function countTokens(text, model = "gpt-4o") {
  const enc = encoding_for_model(model);
  const tokens = enc.encode(text);
  enc.free(); // Important: free the encoder or you'll leak memory in long-running processes
  return tokens.length;
}

function buildPrompt(input, styleGuide, examples) {
  const components = {
    instructions: `Generate a product description for: "${input.name}"\n\n`,
    styleGuide: `Style guidelines:\n${styleGuide}\n\n`,
    examples: `Examples:\n${examples.join("\n\n")}\n\n`,
    inputData: `Product data: ${JSON.stringify(input)}\n\nDescription:`,
  };

  const total = Object.values(components).reduce(
    (sum, text) => sum + countTokens(text), 0
  );

  // Reserve 1000 tokens for output
  if (total > 127000) {
    // Trim examples first — they're the most compressible
    // This is where you make product decisions about what to sacrifice
    throw new Error(`Prompt too large: ${total} tokens. Trim examples.`);
  }

  return Object.values(components).join("");
}
Enter fullscreen mode Exit fullscreen mode

The enc.free() call is the thing nobody mentions. Run this without freeing the encoder in a batch job and you'll have a memory leak that's genuinely confusing to debug.

3. Chunking is harder than splitting by character count
The obvious approach to generating long content is: split the input, generate in chunks, concatenate.
The problem is that LLMs don't produce coherent output when stitched mechanically. Section 3 doesn't know what Section 1 decided. The conclusion doesn't know what premises got established in the introduction. You end up with text that passes a grammar check and fails a coherence check.

There are a few approaches here. None of them are perfect.
Option A: Sequential generation with context carry-forward. Generate section 1. Pass section 1's output as context when generating section 2. Continue. This maintains coherence but is slow and your total token spend gets expensive fast.

Option B: Outline-first, then expand. Generate a detailed outline first. Then generate each section using the outline as a reference. This is faster and keeps sections more consistent but requires a good outline generation step that's its own challenge.

Option C: One-shot long generation. If your content fits in a single API call, just... do it that way. Resist the urge to chunk prematurely. I chunked things that didn't need chunking for weeks before realizing I was adding complexity for no reason.

async function generateLongContent(topic, targetLength = 1000) {
  // Option B: outline-first approach
  const outline = await generateWithRetry(
    `Create a detailed outline for a ${targetLength}-word article about "${topic}".
     Return as JSON: { sections: [{ title: string, keyPoints: string[] }] }`
  );

  let parsedOutline;
  try {
    // Strip markdown code fences if the model returns them — it often does
    const cleaned = outline.replace(/```
{% endraw %}
json|
{% raw %}
```/g, "").trim();
    parsedOutline = JSON.parse(cleaned);
  } catch {
    throw new Error("Outline generation returned invalid JSON. Retry or adjust prompt.");
  }

  const sections = [];

  for (const section of parsedOutline.sections) {
    const content = await generateWithRetry(`
      You're writing a section of an article about "${topic}".

      Full outline for context:
      ${JSON.stringify(parsedOutline.sections)}

      Write the "${section.title}" section now.
      Cover these points: ${section.keyPoints.join(", ")}.
      Length: approximately ${Math.round(targetLength / parsedOutline.sections.length)} words.
      Do not include a header — just the prose.
    `);

    sections.push({ title: section.title, content });

    // Brief delay between sections to stay under RPM limits
    await sleep(500);
  }

  return sections;
}
Enter fullscreen mode Exit fullscreen mode

The thing that surprised me: passing the full outline on every section request felt wasteful (more tokens, more cost), but removing it produced noticeably inconsistent section lengths and tone. The outline acts as an implicit contract that keeps the whole piece coherent. Those extra tokens are earning their keep.

4. Deduplication is a business logic problem, not a technical one
When you're running bulk generation with retries, you will generate the same content more than once. A request that appeared to fail (network timeout, for instance) may have succeeded on the API side — you just never got the response. If you retry it, you've now queued a second generation for the same input.

The naive fix is: check if this input already exists in the database before generating. But this doesn't work reliably when you're running concurrent workers, because two workers can both check for the same record at the same time, both get "not found," and both generate.

The fix is idempotency at the database layer, not the application layer.

// Using PostgreSQL's INSERT ... ON CONFLICT DO NOTHING
// This makes the write atomic — if two workers race, one wins and one no-ops

async function saveGeneratedContent(inputHash, content, metadata) {
  const result = await db.query(
    `INSERT INTO generated_content (input_hash, content, metadata, created_at)
     VALUES ($1, $2, $3, NOW())
     ON CONFLICT (input_hash) DO NOTHING
     RETURNING id`,
    [inputHash, content, metadata]
  );

  // result.rows.length === 0 means a duplicate — the other worker won
  // This is fine. Don't retry. Don't throw. Just move on.
  return result.rows[0] ?? null;
}

// Generate a deterministic hash from the input
import { createHash } from "crypto";

function hashInput(input) {
  return createHash("sha256")
    .update(JSON.stringify(input))
    .digest("hex");
}
Enter fullscreen mode Exit fullscreen mode

Two workers racing to write the same content: one inserts successfully, one gets a no-op, both continue processing their queues. No duplicates. No errors. No extra logic needed at the application level.

The thing I wish I'd known earlier: the deduplication question is really a question about what "the same input" means for your use case. Is {"name": "Blue Sweater", "color": "blue"} the same as {"color": "blue", "name": "Blue Sweater"}? For JSON objects, key ordering isn't guaranteed to be consistent across your application. JSON.stringify may produce different strings from the same logical object depending on how the object was constructed. Normalize before you hash — sort keys, lowercase strings, whatever's appropriate for your domain.

What I wish I'd known before writing the first version
The failure modes are quiet. A 429 that gets swallowed by a try/catch without logging looks identical to a successful request that returned nothing. Add observability first. Not after things break — before.

Token budgets compound. A style guide that's 500 tokens "too long" costs you $0.003 extra per request. Across 10,000 requests, that's $30. Optimize token costs the same way you'd optimize database queries — not obsessively, but intentionally.

Output length is not output quality. Every team I've talked to has had a phase where they equate longer generation with better generation. It's not true. A 300-word product description can be excellent. A 1,200-word product description can be redundant padding. Set max_tokens deliberately, not generously.

Batching and streaming solve different problems. If your use case is offline batch generation (like my e-commerce project), optimize for throughput and cost. If your use case is interactive generation where a user is waiting, optimize for time-to-first-token with streaming. These are different architectures and it's worth deciding which one you're building before you start.

Why I built something around this
After running into these same four problems on multiple projects — for myself and for clients — I got tired of rebuilding the same infrastructure. That led me to build CAT (Content Automation Tool), which handles rate limiting, chunking, context management, and deduplication as core features rather than afterthoughts.

But honestly? You don't need a tool to solve these problems. You need to understand them first. The code above is the foundation — handle your rate limits properly, count tokens before you send them, think about chunking strategy upfront, and make your writes idempotent. That's 90% of the complexity.

The last 10% is where things get interesting. But that's a different post.

What's your scale?
I've described this from the perspective of hundreds-to-low-thousands of requests per batch. The architecture changes meaningfully above 10,000 requests/hour — you start needing proper job queues, worker pools, and different rate limit strategies.

Where are you hitting complexity in your generation pipelines? Specifically: rate limits, output quality, or something else entirely?

Top comments (0)