DEV Community

Cover image for How to Generate AI Images, Videos, Music & Speech Programmatically with JavaScript
sebyx07
sebyx07

Posted on

How to Generate AI Images, Videos, Music & Speech Programmatically with JavaScript

AI-generated media is everywhere — product images, marketing videos, background music, voiceovers. But if you've tried integrating it into a JavaScript project, you know the pain: each provider has its own SDK, its own auth flow, its own async patterns. You end up writing more glue code than actual product logic.

In this post, I'll walk through how to programmatically generate images, videos, music, and speech in JavaScript/TypeScript — with real code you can copy into your project today.


The Problem with Multiple AI Media APIs

Let's say you want your app to generate a product image, a short promo video, and a background track. You'd need:

  • An OpenAI or Google account for image generation
  • A Kling or Veo account for video generation
  • A Suno account for music generation
  • Three different SDKs, three billing dashboards, three sets of API keys
  • Custom polling logic for each (none of them return results synchronously)

That's a lot of overhead before you've written a single line of business logic.


A Simpler Approach: One SDK, All Media Types

You know how OpenRouter gives you a single API to access hundreds of LLMs? The same concept exists for AI media generation.

The ai-media-cli package is essentially an OpenRouter for AI-generated media — a single TypeScript SDK that routes to 90+ models across providers like OpenAI, Google, Black Forest Labs, ByteDance, xAI, and Suno. One API key, one interface, one billing account. Swap models by changing a string, not rewriting your integration.

npm install ai-media-cli
Enter fullscreen mode Exit fullscreen mode
import { AIImageClient } from "ai-media-cli";

const client = new AIImageClient({
  apiKey: process.env.AI_IMAGE_API_KEY,
});
Enter fullscreen mode Exit fullscreen mode

That's your setup. One client, one API key. Let's generate some media.


Generating AI Images in JavaScript

Text-to-Image

const generation = await client.generateMedia({
  prompt: "a product photo of wireless headphones on a marble surface, soft studio lighting",
  model: "nano-banana-2",
  aspect_ratio: "1:1",
});

// Poll until the image is ready
const result = await client.pollForCompletion(generation.generation_id);
const imageUrl = result.outputs?.[0]?.url;
console.log(imageUrl); // https://media.kubeez.com/images/...
Enter fullscreen mode Exit fullscreen mode

The pollForCompletion method handles the async nature of AI generation — it checks status every few seconds and resolves when the image is ready. No webhooks, no manual polling loops.

Image-to-Image

Already have an image and want to transform it? Upload it first, then reference it:

const upload = await client.uploadMedia("./reference-photo.jpg");

const generation = await client.generateMedia({
  prompt: "transform into an oil painting style, rich warm colors",
  model: "nano-banana-2",
  generation_type: "image-to-image",
  source_urls: [upload.url],
});

const result = await client.pollForCompletion(generation.generation_id);
Enter fullscreen mode Exit fullscreen mode

Choosing a Model

Here are some popular image models to get started:

Model ID Provider Credits Best For
nano-banana-2 Google 14 Fast, general purpose
gpt-1.5-image-high OpenAI 32 High quality, prompt adherence
imagen-4 Google 13 Photorealism
imagen-4-fast Google 7 Quick drafts, lowest cost
flux-2-1K Black Forest Labs 10 Artistic, creative styles
seedream-v4-5 ByteDance 9 Detail-rich scenes
Z-image Alibaba 3 Ultra-cheap generations

You can list all available models programmatically:

const models = await client.listModels();
models.forEach((m) => console.log(m.name, m.credits_cost));
Enter fullscreen mode Exit fullscreen mode

Generating AI Videos in JavaScript

Text-to-Video

const generation = await client.generateMedia({
  prompt: "a slow-motion shot of ocean waves crashing on rocks at sunset",
  model: "kling-2-6-text-to-video-5s-audio",
  generation_type: "text-to-video",
  aspect_ratio: "16:9",
});

const result = await client.pollForCompletion(generation.generation_id);
console.log(result.outputs?.[0]?.url); // MP4 URL
Enter fullscreen mode Exit fullscreen mode

Image-to-Video

Turn a static image into a short video — great for product animations or social content:

const generation = await client.generateMedia({
  prompt: "gentle camera zoom in with soft parallax motion",
  model: "kling-2-6-image-to-video-5s",
  generation_type: "image-to-video",
  source_urls: ["https://example.com/my-product-shot.jpg"],
});

const result = await client.pollForCompletion(generation.generation_id);
Enter fullscreen mode Exit fullscreen mode

Available video models include Kling 2.6/3.0 (Kuaishou), Seedance 1.5 (ByteDance), Veo 3.1 (Google), Grok (xAI), and Wan 2.5 (Alibaba) — each with different strengths for motion quality, duration, resolution, and style.


Generating AI Music in JavaScript

Need a background track for a video, podcast intro, or game?

const generation = await client.generateMusic({
  prompt: "upbeat lo-fi hip hop, vinyl crackle, mellow piano chords, rainy day vibes",
  model: "V5",
  instrumental: true,
});

const result = await client.pollForCompletion(generation.generation_id, "music");
console.log(result.outputs?.[0]?.url); // Audio file URL
Enter fullscreen mode Exit fullscreen mode

Set instrumental: false if you want AI-generated vocals. The V5 model (powered by Suno) produces full-length tracks with surprisingly good musical structure.


Generating AI Speech in JavaScript

For voiceovers, narration, or dialogue — you can generate multi-speaker audio:

const generation = await client.generateDialogue({
  dialogue: [
    { text: "Welcome back to the show. Today we're talking about AI in production.", voice: "Adam" },
    { text: "Thanks for having me. Let's dive right in.", voice: "Emily" },
    { text: "So what's the biggest challenge teams face when shipping AI features?", voice: "Adam" },
    { text: "Honestly? It's the integration complexity. The AI part is the easy part.", voice: "Emily" },
  ],
  stability: 0.7,
  language_code: "en",
});
Enter fullscreen mode Exit fullscreen mode

This is particularly useful for generating podcast-style content, app onboarding audio, or dynamic narration.


Practical Example: Automated Social Media Content Pipeline

Here's a real-world use case — a script that generates a complete social media post with image, caption-ready metadata, and background music:

import { AIImageClient } from "ai-media-cli";

const client = new AIImageClient({ apiKey: process.env.AI_IMAGE_API_KEY });

async function generateSocialContent(topic: string) {
  // Generate visual and audio in parallel
  const [imageGen, musicGen] = await Promise.all([
    client.generateMedia({
      prompt: `social media post visual for: ${topic}, modern design, bold colors, 4K`,
      model: "nano-banana-2",
      aspect_ratio: "1:1",
    }),
    client.generateMusic({
      prompt: `short upbeat jingle for social media, 15 seconds, ${topic} vibes`,
      model: "V5",
      instrumental: true,
    }),
  ]);

  // Wait for both to complete in parallel
  const [image, music] = await Promise.all([
    client.pollForCompletion(imageGen.generation_id),
    client.pollForCompletion(musicGen.generation_id, "music"),
  ]);

  return {
    imageUrl: image.outputs?.[0]?.url,
    musicUrl: music.outputs?.[0]?.url,
  };
}

const content = await generateSocialContent("spring product launch");
console.log(content);
Enter fullscreen mode Exit fullscreen mode

Since image and music generation are independent, we fire them off in parallel with Promise.all — cutting total wait time roughly in half.


Writing Better Prompts for Complex Images

The quality of your output depends heavily on your prompt. Here are patterns that produce significantly better results:

Be specific about composition, lighting, and style

// Vague — unpredictable results
await client.generateMedia({
  prompt: "a cat",
  model: "nano-banana-2",
});

// Specific — consistent, high-quality output
await client.generateMedia({
  prompt: "a tabby cat sitting on a windowsill, golden hour sunlight streaming through sheer curtains, shallow depth of field, shot on 85mm lens, warm color palette, photorealistic",
  model: "nano-banana-2",
  aspect_ratio: "16:9",
});
Enter fullscreen mode Exit fullscreen mode

Use negative prompts to remove unwanted elements

await client.generateMedia({
  prompt: "professional headshot of a business executive, studio lighting, neutral gray background, sharp focus, 4K",
  negative_prompt: "blurry, distorted face, extra fingers, low quality, cartoon, illustration, watermark, text",
  model: "nano-banana-2",
});
Enter fullscreen mode Exit fullscreen mode

Layer details: subject, environment, mood, technical

A good prompt formula: [subject] + [setting/environment] + [mood/atmosphere] + [technical/style details]

// Product photography
await client.generateMedia({
  prompt: "minimalist flat lay of a leather watch on a slate surface, single stem of eucalyptus, soft diffused overhead lighting, muted earth tones, editorial product photography, 8K detail",
  model: "gpt-1.5-image-high",
  aspect_ratio: "1:1",
});

// Architectural visualization
await client.generateMedia({
  prompt: "modern Japanese tea house with floor-to-ceiling glass walls, surrounded by a moss garden and maple trees in autumn, morning fog, architectural photography by Hiroshi Sugimoto, ultra wide angle, natural light",
  model: "imagen-4-ultra",
  aspect_ratio: "16:9",
});

// Fantasy / creative
await client.generateMedia({
  prompt: "an ancient library carved inside a giant redwood tree, bioluminescent mushrooms providing soft blue-green light, leather-bound books on spiral shelves, tiny fireflies, magical realism, painted by Studio Ghibli, intricate detail",
  model: "flux-2-2K",
  aspect_ratio: "9:16",
});

// UI/UX mockup
await client.generateMedia({
  prompt: "clean mobile app interface for a meditation app, dark mode, glassmorphism cards, subtle gradient backgrounds in deep purple and navy, minimal icons, SF Pro Display font, Dribbble quality, UI design",
  model: "nano-banana-2",
  aspect_ratio: "9:16",
});
Enter fullscreen mode Exit fullscreen mode

Match the model to the task

  • Photorealism: imagen-4, gpt-1.5-image-high, seedream-v4-5
  • Creative/artistic: flux-2-2K, nano-banana-pro
  • Quick iterations: imagen-4-fast, Z-image, nano-banana
  • High resolution: nano-banana-2-4K, nano-banana-pro-4K, flux-2-2K
  • Image editing: flux-2-edit-1K, seedream-v4-5-edit, nano-banana-edit

Using It from the Command Line

If you just need a quick generation without writing code, the package also works as a CLI:

# Install globally
npm install -g ai-media-cli

# One-time auth
ai-media-cli login YOUR_API_KEY

# Generate an image
ai-media-cli generate -p "product flat lay, minimal, white background" -m nano-banana-2 --wait

# Generate a video
ai-media-cli generate -p "drone flyover mountain valley" -m kling-2-6-text-to-video-5s-audio -t text-to-video --wait

# Generate music
ai-media-cli music -p "ambient electronic, space theme" -m V5 --wait

# Or use npx without installing
npx ai-media-cli generate -p "..." -m nano-banana-2 --wait
Enter fullscreen mode Exit fullscreen mode

Great for prototyping prompts before coding them into your app.


CI/CD Integration

You can automate media generation in your pipelines. Here's a GitHub Actions example:

- name: Generate marketing assets
  env:
    AI_IMAGE_API_KEY: ${{ secrets.AI_IMAGE_API_KEY }}
  run: |
    npx ai-media-cli generate \
      -p "hero banner for product launch, abstract gradient, modern" \
      -m nano-banana-2 \
      -a 16:9 \
      --wait
Enter fullscreen mode Exit fullscreen mode

Error Handling

The library provides typed errors for clean handling in production:

import { AIImageClient, APIError, TimeoutError, ConfigError } from "ai-media-cli";

try {
  const result = await client.generateMedia({ prompt: "...", model: "nano-banana-2" });
} catch (error) {
  if (error instanceof APIError) {
    console.error(`API error ${error.statusCode}: ${error.message}`);
  } else if (error instanceof TimeoutError) {
    console.error("Generation timed out — try a simpler prompt or different model");
  } else if (error instanceof ConfigError) {
    console.error("Missing API key — set AI_IMAGE_API_KEY or call client with { apiKey }");
  }
}
Enter fullscreen mode Exit fullscreen mode

Built-in retry logic handles transient failures and rate limits automatically with exponential backoff — you don't need to add your own.


All 91 Available Models

Here's the full list of models you can access through the SDK, organized by type:

Image Models (27)

Model ID Name Provider Credits
nano-banana Nano Banana Google 7
nano-banana-2 Nano Banana 2 Google 14
nano-banana-2-2K Nano Banana 2 2K Google 17
nano-banana-2-4K Nano Banana 2 4K Google 25
nano-banana-edit Nano Banana Edit Google 7
nano-banana-pro Nano Banana Pro Google 31
nano-banana-pro-2K Nano Banana Pro 2K Google 31
nano-banana-pro-4K Nano Banana Pro 4K Google 41
imagen-4-fast Imagen 4 Fast Google 7
imagen-4 Imagen 4 Google 13
imagen-4-ultra Imagen 4 Ultra Google 19
gpt-1.5-image-medium GPT 1.5 Image Medium OpenAI 10
gpt-1.5-image-high GPT 1.5 Image High OpenAI 32
flux-2-1K Flux 2 1K Black Forest Labs 10
flux-2-2K Flux 2 2K Black Forest Labs 13
flux-2-edit-1K Flux 2 Edit 1K Black Forest Labs 10
flux-2-edit-2K Flux 2 Edit 2K Black Forest Labs 13
grok-text-to-image Grok Text-to-Image xAI 14
seedream-v4 Seedream V4 ByteDance 7
seedream-v4-5 Seedream 4.5 ByteDance 9
seedream-v4-edit Seedream V4 Edit ByteDance 7
seedream-v4-5-edit Seedream V4.5 Edit ByteDance 9
5-lite-text-to-image Seedream 5 Lite ByteDance 12
5-lite-image-to-image Seedream 5 Lite I2I ByteDance 12
Z-image Z-image Alibaba 3
replicate-z-image-hd Z-Image HD Alibaba 6
ad-copy Ad Copy Kubeez 50

Video Models (53)

Model ID Name Provider Credits
kling-2-5-image-to-video-pro Kling 2.5 I2V Pro Kuaishou 65
kling-2-5-image-to-video-pro-10s Kling 2.5 I2V Pro 10s Kuaishou 125
kling-2-6-text-to-video-5s Kling 2.6 (5s) Kuaishou 90
kling-2-6-text-to-video-5s-audio Kling 2.6 (5s, audio) Kuaishou 175
kling-2-6-text-to-video-10s Kling 2.6 (10s) Kuaishou 175
kling-2-6-text-to-video-10s-audio Kling 2.6 (10s, audio) Kuaishou 310
kling-2-6-image-to-video-5s Kling 2.6 I2V (5s) Kuaishou 90
kling-2-6-image-to-video-5s-audio Kling 2.6 I2V (5s, audio) Kuaishou 175
kling-2-6-image-to-video-10s Kling 2.6 I2V (10s) Kuaishou 175
kling-2-6-image-to-video-10s-audio Kling 2.6 I2V (10s, audio) Kuaishou 310
kling-2-6-motion-control-720p Kling 2.6 Motion Kuaishou 270
kling-2-6-motion-control-1080p Kling 2.6 Motion Pro Kuaishou 360
kling-3-0-std Kling 3.0 Kuaishou 105
kling-3-0-pro Kling 3.0 Pro Kuaishou 125
kling-3-0-motion-control-720p Kling 3.0 Motion Kuaishou 720
kling-3-0-motion-control-1080p Kling 3.0 Motion Pro Kuaishou 960
seedance-1-5-pro-480p-4s Seedance 1.5 (480p, 4s) ByteDance 12
seedance-1-5-pro-480p-8s Seedance 1.5 (480p, 8s) ByteDance 24
seedance-1-5-pro-480p-12s Seedance 1.5 (480p, 12s) ByteDance 48
seedance-1-5-pro-720p-4s Seedance 1.5 (720p, 4s) ByteDance 19
seedance-1-5-pro-720p-8s Seedance 1.5 (720p, 8s) ByteDance 38
seedance-1-5-pro-720p-12s Seedance 1.5 (720p, 12s) ByteDance 76
seedance-1-5-pro-1080p-4s Seedance 1.5 (1080p, 4s) ByteDance 39
seedance-1-5-pro-1080p-8s Seedance 1.5 (1080p, 8s) ByteDance 78
seedance-1-5-pro-1080p-12s Seedance 1.5 (1080p, 12s) ByteDance 156
seedance-1-5-pro-480p-4s-audio Seedance 1.5 (480p, 4s, audio) ByteDance 24
seedance-1-5-pro-480p-8s-audio Seedance 1.5 (480p, 8s, audio) ByteDance 48
seedance-1-5-pro-480p-12s-audio Seedance 1.5 (480p, 12s, audio) ByteDance 58
seedance-1-5-pro-720p-4s-audio Seedance 1.5 (720p, 4s, audio) ByteDance 38
seedance-1-5-pro-720p-8s-audio Seedance 1.5 (720p, 8s, audio) ByteDance 76
seedance-1-5-pro-720p-12s-audio Seedance 1.5 (720p, 12s, audio) ByteDance 152
seedance-1-5-pro-1080p-4s-audio Seedance 1.5 (1080p, 4s, audio) ByteDance 78
seedance-1-5-pro-1080p-8s-audio Seedance 1.5 (1080p, 8s, audio) ByteDance 156
seedance-1-5-pro-1080p-12s-audio Seedance 1.5 (1080p, 12s, audio) ByteDance 312
v1-pro-fast-i2v-720p-5s Seedance 1.0 (720p, 5s) ByteDance 40
v1-pro-fast-i2v-720p-10s Seedance 1.0 (720p, 10s) ByteDance 80
v1-pro-fast-i2v-1080p-5s Seedance 1.0 (1080p, 5s) ByteDance 80
v1-pro-fast-i2v-1080p-10s Seedance 1.0 (1080p, 10s) ByteDance 130
veo3-1-fast-text-to-video Veo 3.1 Fast T2V Google 99
veo3-1-fast-reference-to-video Veo 3.1 Fast Ref2V Google 99
veo3-1-fast-first-and-last-frames Veo 3.1 Fast F&L Google 99
veo3-1-text-to-video Veo 3.1 Quality T2V Google 390
veo3-1-first-and-last-frames Veo 3.1 Quality F&L Google 390
grok-text-to-video-6s Grok T2V (6s) xAI 65
grok-image-to-video Grok I2V xAI 65
wan-2-5 Wan 2.5 Alibaba 155
wan-2-5-text-to-video-5s-720p Wan 2.5 T2V (5s, 720p) Alibaba 93
wan-2-5-text-to-video-5s-1080p Wan 2.5 T2V (5s, 1080p) Alibaba 155
wan-2-5-text-to-video-10s-720p Wan 2.5 T2V (10s, 720p) Alibaba 186
wan-2-5-text-to-video-10s-1080p Wan 2.5 T2V (10s, 1080p) Alibaba 310
wan-2-5-image-to-video-5s-720p Wan 2.5 I2V (5s, 720p) Alibaba 93
wan-2-5-image-to-video-5s-1080p Wan 2.5 I2V (5s, 1080p) Alibaba 155
wan-2-5-image-to-video-10s-1080p Wan 2.5 I2V (10s, 1080p) Alibaba 310

Music Models (7)

Model ID Name Provider Credits
V4 Music V4 Suno 19
V4_5 Music V4.5 Suno 19
V4_5PLUS Music V4.5+ Suno 19
V5 Music V5 Suno 19
suno-add-vocals Add Vocals Suno 19
suno-add-instrumental Add Instrumental Suno 19
suno-lyrics-generation Lyrics Generation Suno 2

Speech, Separation & Other (4)

Model ID Name Provider Credits
text-to-dialogue-v3 Text To Dialogue V3 ElevenLabs 21
mvsep-40 Vocal/Instrumental Separation MVSEP 6
auto-caption Auto Caption Kubeez 0.5

Getting Started

  1. Grab a free API key at kubeez.com (comes with trial credits)
  2. Install the package:
npm install ai-media-cli
Enter fullscreen mode Exit fullscreen mode
  1. Start generating:
import { AIImageClient } from "ai-media-cli";
const client = new AIImageClient({ apiKey: "your-key" });
Enter fullscreen mode Exit fullscreen mode

The barrier to generating AI media programmatically has dropped significantly. What used to require stitching together multiple provider SDKs is now a few lines of JavaScript. If you're building anything that touches images, video, audio, or voice — give it a try and let me know what you build.

Top comments (0)