DEV Community

Cover image for How I Built an AI Emoji Generator with Next.js 15 & Cloudflare Workers AI
xiangbin zhuang
xiangbin zhuang

Posted on • Originally published at forgemoji.com

How I Built an AI Emoji Generator with Next.js 15 & Cloudflare Workers AI

Every emoji tool I could find did the same thing: let you pick from a fixed set of combos.

Emoji Kitchen has 50K+ monthly visitors exactly because people love emoji combinations. But it's just a lookup table — Google pre-rendered ~40,000 combinations and serves them as static images. There's no AI, no creativity, no support for combinations that don't exist in the dataset.

I wanted to build something different: type two emoji, get a brand-new AI-generated image that's never existed before, with a transparent background so you can actually use it anywhere.

Here's how Forgemoji works under the hood.

A quick note on scope: I'm a UI/UX designer by trade, so I handled the full product — interface design, interaction design, and the engineering. The stack choices below reflect someone who thinks in user flows first and infrastructure second.


The Architecture: Three Layers

I ended up with a three-layer generation system, each progressively more capable (and more expensive):

Layer 1 — Static lookup     ← REMOVED (copyright risk)
Layer 2 — Text-to-image     ← Primary path (free, ~30s)
Layer 3 — Image-to-image    ← Upload your photo, fuse it with an emoji (~120s, async)
Enter fullscreen mode Exit fullscreen mode

Layer 1 was the original plan — use Google's pre-rendered emoji kitchen images. I killed it on day 5 after reading Google's image attribution policy more carefully. The risk wasn't worth it.


Layer 2: Text-to-Image with a Provider Fallback Chain

The core challenge with T2I emoji generation is: how do you make the model output something that actually looks like an emoji?

The naive approach — "🐱 + 🔥" — produces whatever the model thinks a cat fire is. Usually it's a realistic cat on fire. Not what you want.

Prompt Engineering: Descriptions > Unicode

I built a mapping table (emoji-prompt-map.ts) that translates emoji into concrete visual descriptions:

const EMOJI_DESCRIPTIONS: Record<string, string> = {
  '😀': 'round yellow face, wide open grin showing teeth, simple oval dot eyes',
  '🔥': 'bright orange and red flame, cartoon style, rounded base',
  '🐱': 'cute round cat face, large eyes, small pink nose, whiskers',
  // 200+ entries...
}
Enter fullscreen mode Exit fullscreen mode

Then the full prompt looks like:

function buildLayer2Prompt(emoji1: string, emoji2: string): string {
  const desc1 = EMOJI_DESCRIPTIONS[emoji1] ?? emoji1
  const desc2 = EMOJI_DESCRIPTIONS[emoji2] ?? emoji2
  return [
    `A single emoji character that fuses: [${desc1}] merged with [${desc2}].`,
    'Flat cartoon illustration style. Centered on pure white background.',
    'One single character only, no text, no multiple objects side by side.',
    'Clean vector-like look, bold outlines, vivid saturated colors.',
  ].join(' ')
}
Enter fullscreen mode Exit fullscreen mode

The "one single character only" constraint is critical. Without it, models love to render two separate objects next to each other — a cat on the left, a flame on the right — which defeats the whole point.

The Provider Chain

No single free provider is reliable enough to use alone. I set up a priority fallback chain:

T2I: Cloudflare Workers AI (flux-1-schnell)
       → ModelScope Z-Image-Turbo
         → MiniMax image-01
Enter fullscreen mode Exit fullscreen mode

Cloudflare's Workers AI gives ~230 free generations per day. For a bootstrapped side project, that's more than enough to handle organic traffic without paying anything. When it's exhausted or errors out, the request automatically falls to the next provider.

// Simplified provider selection
async function generateImage(opts: GenerateOptions): Promise<string> {
  const providers = isI2I(opts)
    ? ['gemini-proxy', 'gpt-proxy', 'modelscope-i2i', 'minimax-i2i']
    : ['cloudflare', 'modelscope-t2i', 'minimax-t2i']

  for (const provider of providers) {
    try {
      return await callProvider(provider, opts)
    } catch (err) {
      logProviderError(provider, err)
      // try next
    }
  }
  throw new Error('All providers failed')
}
Enter fullscreen mode Exit fullscreen mode

Each provider failure fires a Discord webhook alert, so I can see in real time if a provider is down.


The Transparent Background Problem

This is where most emoji tools stop. They return a white or colored background, which makes the result useless for Discord stickers, Telegram emoji, or overlay use.

I run every generated image through rembg, hosted on my own server:

async function removeBackground(imageBase64: string): Promise<string> {
  const response = await fetch(process.env.REMBG_API_URL!, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'X-API-Key': process.env.REMBG_API_KEY!,
    },
    body: JSON.stringify({ image_base64: imageBase64 }),
  })
  const { result } = await response.json()
  return result // base64 PNG with alpha channel
}
Enter fullscreen mode Exit fullscreen mode

But here's the UX problem: rembg takes 2–4 seconds. If you wait for it before showing anything, the whole generation feels slow.

Solution: Show the image immediately with the background, then swap to the transparent version once rembg finishes. The user gets visual feedback fast, and the "better" version appears a few seconds later without any loading state.

// In the API route: fire rembg async, don't await it before responding
const rembgPromise = removeBackground(rawImageBase64).catch(() => null)

// Return raw image first, rembg result comes via a separate /api/rembg call
// triggered client-side after the initial image loads
Enter fullscreen mode Exit fullscreen mode

Layer 3: Image-to-Image (Upload Your Face)

The I2I mode lets users upload a photo and fuse it with an emoji style. This is the "Genmoji" experience — without requiring an Apple device.

The challenge here is latency. I2I models take 60–120 seconds. A synchronous API call times out on Vercel (max function timeout is 300s, and 120s is cutting it close).

I solved it with a submit/poll architecture:

POST /api/generate/submit    { task_id }
GET  /api/generate/poll?id=xxx    { status: 'pending' | 'done' | 'error', result? }
Enter fullscreen mode Exit fullscreen mode

The client submits the job, gets a task_id, then polls every 3 seconds until done. The UI shows a countdown timer so users know roughly how long to wait.

For I2I, prompt engineering gets trickier. The model needs to clearly understand which elements to keep from the photo (the person's face) and which to replace with emoji style. I use a "dual prompt" strategy:

function buildLayer3DualPrompt(userEmoji: string, photoContext: string): string {
  const emojiDesc = EMOJI_DESCRIPTIONS[userEmoji] ?? userEmoji
  return [
    `Transform the person in the photo into a single emoji character.`,
    `Emoji style: ${emojiDesc}.`,
    `Keep the person's facial features and expression. Apply flat cartoon illustration style.`,
    `Result must be ONE centered emoji character. No background, no text.`,
  ].join(' ')
}
Enter fullscreen mode Exit fullscreen mode

Rate Limiting Without a Database

I didn't want to add a database just for rate limiting. Solution: IP-based limits stored in a Vercel KV-compatible in-memory map (good enough for the current traffic level, will migrate when needed):

  • 5 generations per IP per day
  • 3 generations per IP per minute (burst protection)
  • 500 total generations per day across all users
// In-memory store — resets on each Vercel function cold start
// Good enough for <1K daily users, not suitable for horizontal scaling
const ipDayMap = new Map<string, number>()
const ipMinuteMap = new Map<string, number>()
let globalDailyCount = 0
Enter fullscreen mode Exit fullscreen mode

Is this production-grade? No. But for a side project at <500 daily users, it works perfectly and avoids adding infrastructure complexity before you need it.


The Animated Emoji Feature

Static emoji are fine. Animated emoji are shareable.

I added 6 preset animations (Bounce, Float, Wiggle, Pulse, Rubber, Spin) using Canvas frame-by-frame rendering, then encoding to GIF or WebP.

The browser-side approach (gif.js) is the fallback — it works but is slow on low-end devices. The real implementation sends the PNG to my server and uses FFmpeg:

Client: PNG + animation_name + size + format
  → POST /api/animate (my rembg server also handles this)
  → FFmpeg palettegen + paletteuse with reserve_transparent=1
  → Return GIF / WebP bytes
Enter fullscreen mode Exit fullscreen mode

For transparent GIFs (critical for Discord/Telegram stickers), the FFmpeg command is:

ffmpeg -i frames_%03d.png \
  -filter_complex "[0:v] palettegen=reserve_transparent=1 [p]; [0:v][p] paletteuse=alpha_threshold=128" \
  output.gif
Enter fullscreen mode Exit fullscreen mode

Getting the transparency right took way longer than I expected. The reserve_transparent=1 + alpha_threshold=128 combination is the key — without both flags, you get either a black background or jagged edges.


What's Next

The tool is live at forgemoji.com — free to use, no account required.

Current stats after ~2 weeks:

  • 261 pre-generated combo pages for SEO long-tail
  • 3-provider T2I fallback chain (zero paid cost for up to ~230 daily generations)
  • Full animated export: 6 effects × 3 sizes × 2 formats (GIF/WebP)

Things I'm still working on:

  • AdSense monetization (applied, pending review)
  • Product Hunt launch (waiting for stable UV baseline)
  • More emoji in the mapping table (currently ~200 entries, want 500+)

If you're building something similar and ran into the same "how do I make AI output something that actually looks like an emoji" problem, I hope this helps. Happy to answer questions in the comments.

Top comments (0)