Propfirmkey

Posted on Mar 2

Building a Dynamic Image Generation Pipeline with Gemini AI and Sharp in Node.js

#node #typescript #ai #imageprocessing

Every developer eventually faces the same problem: you need to generate images programmatically. Maybe it is social media posts that need to go out daily, marketing assets for dozens of product variants, or dynamic Open Graph images for a blog. Hiring a designer for every permutation is not scalable. Using Canva templates feels brittle. What you actually want is a pipeline — AI generates the base image, your code handles the rest.

In this article, we will build exactly that. We will use Google's Gemini API to generate images from structured prompts, then Sharp to post-process them: circular logo overlays, text via SVG, format conversion, and multi-platform sizing. Everything in TypeScript, everything you can drop into a CI job or cron.

Architecture Overview

The pipeline has three stages:

[Structured Prompt] → [Gemini API] → [Raw PNG Buffer]
                                          ↓
                                    [Sharp Pipeline]
                                          ↓
                          ┌───────────────┼───────────────┐
                          ↓               ↓               ↓
                     [1080x1080]     [1080x1920]     [1000x1500]
                      Instagram        Stories          Pinterest
                       Feed            / Reels

Gemini generates a base image. Sharp takes that buffer and composites logos, text overlays, and crops it into multiple aspect ratios. No temporary files on disk — everything stays in memory as buffers until the final write.

Project Setup

Initialize a TypeScript project and install the dependencies:

mkdir image-pipeline && cd image-pipeline
npm init -y
npm install sharp @google/genai
npm install -D typescript @types/node
npx tsc --init

In your tsconfig.json, set the essentials:

{
  "compilerOptions": {
    "target": "ES2022",
    "module": "Node16",
    "moduleResolution": "Node16",
    "outDir": "./dist",
    "strict": true,
    "esModuleInterop": true,
    "skipLibCheck": true
  },
  "include": ["src/**/*"]
}

Type Definitions

Start with clear types. This keeps the pipeline predictable:

// src/types.ts

export interface GenerationRequest {
  prompt: string;
  style?: string;
  negativePrompt?: string;
}

export interface ProcessingOptions {
  logo?: {
    path: string;
    size: number;
    position: 'top-left' | 'top-right' | 'bottom-left' | 'bottom-right';
    margin: number;
  };
  textOverlay?: {
    text: string;
    fontSize: number;
    color: string;
    position: 'top' | 'bottom';
    backgroundColor?: string;
  };
  outputs: OutputFormat[];
}

export interface OutputFormat {
  name: string;
  width: number;
  height: number;
  format: 'png' | 'jpeg' | 'webp';
  quality?: number;
}

Calling Gemini's Image Generation API

The Gemini API accepts a text prompt and returns image data as base64. The key insight: your prompt engineering matters more than any post-processing. Spend time here.

// src/generate.ts

import { GoogleGenAI, type GenerateContentResponse } from '@google/genai';
import type { GenerationRequest } from './types.js';

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY! });

export async function generateBaseImage(
  request: GenerationRequest
): Promise<Buffer> {
  const prompt = buildPrompt(request);

  const response: GenerateContentResponse = await ai.models.generateContent({
    model: 'gemini-2.0-flash-preview-image-generation',
    contents: prompt,
    config: {
      responseModalities: ['TEXT', 'IMAGE'],
    },
  });

  const parts = response.candidates?.[0]?.content?.parts;
  if (!parts) {
    throw new Error('No content in Gemini response');
  }

  const imagePart = parts.find(
    (part) => part.inlineData?.mimeType?.startsWith('image/')
  );

  if (!imagePart?.inlineData?.data) {
    throw new Error('No image data in response.');
  }

  return Buffer.from(imagePart.inlineData.data, 'base64');
}

function buildPrompt(request: GenerationRequest): string {
  const parts = [
    request.prompt,
    request.style ? `Style: ${request.style}` : 'Style: clean, modern, professional',
    'The image should have a clean composition with space for text overlays.',
    'Do not include any text or lettering in the image.',
  ];

  if (request.negativePrompt) {
    parts.push(`Avoid: ${request.negativePrompt}`);
  }

  return parts.join('\n');
}

Two things to note. First, we explicitly ask the model not to include text in the image — AI-generated text in images is notoriously unreliable, and we will overlay our own text with Sharp where we have pixel-perfect control. Second, we ask for clean composition with space for overlays.

Post-Processing with Sharp

This is where the pipeline gets interesting. Sharp operates on streams and buffers, which makes it fast and memory-efficient.

Circular Logo Overlay

Logos on images look better when they are circular with a subtle border. Sharp does not have a "circle crop" function, but you can achieve it with an SVG mask:

// src/process.ts

import sharp from 'sharp';
import { readFile } from 'fs/promises';
import type { ProcessingOptions, OutputFormat } from './types.js';

async function createCircularLogo(
  logoPath: string,
  diameter: number
): Promise<Buffer> {
  const logoBuffer = await readFile(logoPath);

  const radius = diameter / 2;
  const mask = Buffer.from(
    `<svg width="${diameter}" height="${diameter}">
      <circle cx="${radius}" cy="${radius}" r="${radius}" fill="white"/>
    </svg>`
  );

  const circularLogo = await sharp(logoBuffer)
    .resize(diameter, diameter, { fit: 'cover' })
    .composite([{ input: mask, blend: 'dest-in' }])
    .png()
    .toBuffer();

  const borderWidth = Math.max(2, Math.round(diameter * 0.02));
  const borderSvg = Buffer.from(
    `<svg width="${diameter}" height="${diameter}">
      <circle cx="${radius}" cy="${radius}" r="${radius - borderWidth / 2}"
        fill="none" stroke="white" stroke-width="${borderWidth}" opacity="0.9"/>
    </svg>`
  );

  return sharp(circularLogo)
    .composite([{ input: borderSvg, blend: 'over' }])
    .png()
    .toBuffer();
}

Text Overlay via SVG

Sharp cannot render text directly. The standard approach is to generate an SVG containing the text and composite it onto the image:

function createTextOverlaySvg(
  width: number,
  text: string,
  fontSize: number,
  color: string,
  backgroundColor?: string
): Buffer {
  const escaped = text
    .replace(/&/g, '&amp;')
    .replace(/</g, '&lt;')
    .replace(/>/g, '&gt;');

  const padding = Math.round(fontSize * 0.8);
  const lineHeight = Math.round(fontSize * 1.4);
  const bgHeight = lineHeight + padding * 2;

  const bgRect = backgroundColor
    ? `<rect x="0" y="0" width="${width}" height="${bgHeight}" fill="${backgroundColor}" opacity="0.75" rx="0"/>`
    : '';

  return Buffer.from(
    `<svg width="${width}" height="${bgHeight}">
      ${bgRect}
      <text x="${width / 2}" y="${padding + lineHeight * 0.75}"
        font-family="Arial, Helvetica, sans-serif"
        font-size="${fontSize}" font-weight="bold"
        fill="${color}" text-anchor="middle"
        letter-spacing="0.5">
        ${escaped}
      </text>
    </svg>`
  );
}

Handling Variable Image Dimensions

AI models do not always return images at the exact dimensions you expect. Gemini might return 1024x1024, or 1536x1024, or something else. Your pipeline must handle this gracefully:

async function normalizeBaseImage(
  imageBuffer: Buffer,
  targetSize: number = 2048
): Promise<{ buffer: Buffer; width: number; height: number }> {
  const metadata = await sharp(imageBuffer).metadata();
  const { width, height } = metadata;

  if (!width || !height) {
    throw new Error('Cannot read image dimensions from AI output');
  }

  const scale = targetSize / Math.max(width, height);

  const normalized = await sharp(imageBuffer)
    .resize({
      width: Math.round(width * scale),
      height: Math.round(height * scale),
      fit: 'fill',
      kernel: sharp.kernel.lanczos3,
    })
    .png()
    .toBuffer();

  const newMeta = await sharp(normalized).metadata();

  return {
    buffer: normalized,
    width: newMeta.width!,
    height: newMeta.height!,
  };
}

The Full Processing Pipeline

export async function processImage(
  rawBuffer: Buffer,
  options: ProcessingOptions
): Promise<Map<string, Buffer>> {
  const { buffer, width, height } = await normalizeBaseImage(rawBuffer);
  const composites: sharp.OverlayOptions[] = [];

  if (options.logo) {
    const { path, size, position, margin } = options.logo;
    const circularLogo = await createCircularLogo(path, size);

    const positionMap: Record<string, { left: number; top: number }> = {
      'top-left': { left: margin, top: margin },
      'top-right': { left: width - size - margin, top: margin },
      'bottom-left': { left: margin, top: height - size - margin },
      'bottom-right': { left: width - size - margin, top: height - size - margin },
    };

    composites.push({ input: circularLogo, ...positionMap[position] });
  }

  if (options.textOverlay) {
    const { text, fontSize, color, position, backgroundColor } = options.textOverlay;
    const textSvg = createTextOverlaySvg(width, text, fontSize, color, backgroundColor);

    composites.push({
      input: textSvg,
      top: position === 'top' ? 0 : height - Math.round(fontSize * 1.4 + fontSize * 1.6),
      left: 0,
    });
  }

  const composited = composites.length > 0
    ? await sharp(buffer).composite(composites).png().toBuffer()
    : buffer;

  const results = new Map<string, Buffer>();

  for (const output of options.outputs) {
    const result = await sharp(composited)
      .resize(output.width, output.height, {
        fit: 'cover',
        position: sharp.strategy.attention,
      })
      .toFormat(output.format, { quality: output.quality ?? 85 })
      .toBuffer();

    results.set(output.name, result);
  }

  return results;
}

Notice the use of sharp.strategy.attention for cropping. This uses edge detection to find the most interesting part of the image and keeps it centered in the crop.

Performance Tips

Batch your composites. Every call to .composite() decodes and re-encodes the image. Pass all overlays in a single array.

Process output formats in parallel:

const results = await Promise.all(
  options.outputs.map(async (output) => {
    const buffer = await sharp(composited)
      .resize(output.width, output.height, { fit: 'cover', position: sharp.strategy.attention })
      .toFormat(output.format, { quality: output.quality ?? 85 })
      .toBuffer();
    return [output.name, buffer] as const;
  })
);
return new Map(results);

Cache the circular logo. If you are generating many images with the same logo, compute the circular mask once and reuse the buffer.

Limit Sharp concurrency in high-throughput scenarios:

import sharp from 'sharp';
sharp.concurrency(2); // Limit to 2 threads on constrained environments

Error Handling

AI image generation is inherently unreliable. Wrap the generation step with retries:

async function generateWithRetry(
  request: GenerationRequest,
  maxRetries: number = 3
): Promise<Buffer> {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await generateBaseImage(request);
    } catch (error) {
      const message = error instanceof Error ? error.message : String(error);
      console.warn(`Attempt ${attempt}/${maxRetries} failed: ${message}`);
      if (attempt === maxRetries) throw error;
      await new Promise((r) => setTimeout(r, 2000 * Math.pow(2, attempt - 1)));
    }
  }
  throw new Error('Unreachable');
}

Also validate the AI output before processing:

async function validateImage(buffer: Buffer): Promise<void> {
  try {
    const meta = await sharp(buffer).metadata();
    if (!meta.width || !meta.height) {
      throw new Error('Image has no dimensions');
    }
    if (meta.width < 256 || meta.height < 256) {
      throw new Error(`Image too small: ${meta.width}x${meta.height}`);
    }
  } catch (error) {
    throw new Error(`Invalid image from AI: ${error}`);
  }
}

Conclusion

The pattern here is straightforward but powerful: let AI do the creative work, let code handle the mechanical work. Gemini generates a base image that would take a designer thirty minutes. Sharp crops, composites, and exports it into five platform-ready formats in under two seconds.

You can extend this in several directions. Add a queue (BullMQ) to process generation requests asynchronously. Store outputs in S3 and return signed URLs. Use a template system to swap logos and text per client. The compositing logic stays the same regardless of scale.

If you are building anything that needs images at scale — social media automation, e-commerce product cards, dynamic OG images — this architecture will save you from both designer bottlenecks and brittle template systems.

DEV Community