Luke

Posted on Feb 14

We Built a Full-Stack AI Music Agent with Next.js — Here's What We Learned

#ai #webdev #nextjs #javascript

We spent the past few months building Gliss — an AI music agent where you describe the music you want in plain language and an AI agent handles everything: generating songs, extracting stems, creating MIDI, mastering audio, and more.

It's a Next.js app running on Vercel, and the journey taught us things we never found in any tutorial. This post covers the real, hard-won lessons — the stuff that breaks at 2AM in production.

The Stack

Before we dive in:

Framework: Next.js 16 (App Router)
Auth: Clerk
Payments: Stripe
Audio: Web Audio API + WaveSurfer.js
AI: Custom agent orchestrating multiple music AI providers
i18n: next-intl (32 languages)
State: Zustand + TanStack Query
UI: Radix primitives + Tailwind
Hosting: Vercel + S3-compatible object storage

Lesson 1: Streaming AI Responses Requires Rethinking Your Data Flow

When a user says "make me a lo-fi beat with jazz piano," the AI agent doesn't just return text — it generates a song, creates cover art, extracts metadata, and streams progress updates back to the UI. All in a single conversation turn.

The naive approach is to wait for the entire response and then render. But music generation takes 30-120 seconds. You need to stream.

Here's what we learned:

Server-Sent Events (SSE) over fetch. Not WebSockets. For a conversational AI interface, SSE is simpler and works perfectly with Vercel's serverless model. WebSockets would require a persistent connection and a separate infrastructure layer.

// Simplified streaming pattern
const response = await fetch('/api/agent', {
  method: 'POST',
  body: JSON.stringify({ message: userInput }),
});

const reader = response.body?.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  const chunk = decoder.decode(value);
  // Parse SSE events: text deltas, resource creation, progress updates
  processStreamEvents(chunk);
}

The tricky part isn't the streaming itself — it's state management during a stream. When the agent creates a new audio resource mid-stream, you need to:

Update the chat message (append text)
Add the new resource to the resource panel
Trigger a waveform render for the new audio
Update the credit balance

All of this needs to happen smoothly without re-renders that cause audio playback glitches.

What we'd do differently: Design your state management around streaming from day one. We started with simple useState and had to refactor to Zustand stores + refs to avoid cascade re-renders during active streams.

Lesson 2: Browser Audio Processing Is Harder Than You Think

The studio includes a real-time mastering chain — EQ, compression, stereo width, limiter — all running in the browser via Web Audio API. Users can tweak mastering settings and hear changes in real time, then export the mastered MP3.

Here's where it gets interesting: real-time playback and offline rendering must produce identical output.

// The mastering pipeline (simplified)
async function renderMasteredBuffer(
  audioUrl: string,
  settings: MasteringSettings
): Promise<AudioBuffer> {
  const offlineCtx = new OfflineAudioContext(
    2,                    // stereo
    sampleRate * duration,
    sampleRate
  );

  // Build the same effect chain used in real-time playback
  const source = offlineCtx.createBufferSource();
  const eq = createParametricEQ(offlineCtx, settings.eq);
  const compressor = createCompressor(offlineCtx, settings.compression);
  const limiter = createLimiter(offlineCtx, settings.limiter);

  source.connect(eq).connect(compressor).connect(limiter).connect(offlineCtx.destination);
  source.start(0);

  return offlineCtx.startRendering();
}

The gotcha: OfflineAudioContext and regular AudioContext can produce subtly different results if your filter frequencies or parameter ramps aren't identical. We had to extract all shared constants into a single types file to ensure bit-perfect parity.

Another painful lesson: MP3 encoding in the browser. We use lamejs (a JavaScript LAME port) to encode AudioBuffers to MP3 client-side. This lets users export mastered audio without re-uploading to a server. But lamejs is CPU-intensive — encoding a 3-minute song can block the main thread for 2-3 seconds.

The fix: process in chunks and yield back to the event loop:

async function encodeToMp3(audioBuffer: AudioBuffer): Promise<Blob> {
  const mp3encoder = new lamejs.Mp3Encoder(2, audioBuffer.sampleRate, 192);
  const chunks: Int8Array[] = [];
  const blockSize = 1152;

  for (let i = 0; i < audioBuffer.length; i += blockSize) {
    const left = audioBuffer.getChannelData(0).slice(i, i + blockSize);
    const right = audioBuffer.getChannelData(1).slice(i, i + blockSize);

    const mp3buf = mp3encoder.encodeBuffer(
      floatTo16BitPCM(left),
      floatTo16BitPCM(right)
    );

    if (mp3buf.length > 0) chunks.push(mp3buf);

    // Yield to prevent UI freeze
    if (i % (blockSize * 100) === 0) {
      await new Promise(resolve => setTimeout(resolve, 0));
    }
  }

  const end = mp3encoder.flush();
  if (end.length > 0) chunks.push(end);

  return new Blob(chunks, { type: 'audio/mp3' });
}

Lesson 3: File Uploads on Vercel Have a Hidden Limit

Vercel serverless functions have a 4.5MB body size limit. That sounds fine until you realize a single mastered audio file is easily 5-10MB.

Our first approach was client → Next.js API route → object storage. This broke immediately for any real audio file.

The solution: direct client-to-storage uploads with pre-signed URLs:

1. Client requests a signed upload URL from our API (tiny JSON payload)
2. Client uploads the file directly to object storage (no Vercel size limit)
3. Client sends the resulting URL back to our API (tiny JSON payload)
4. API updates the database metadata

Every step stays well under 4.5MB. The heavy file transfer bypasses Vercel entirely.

// Upload flow that bypasses Vercel's body limit
export async function uploadFileToStorageFromClient({
  file,
  filename,
  key,
}: {
  file: Blob;
  filename: string;
  key: string;
}): Promise<{ url: string }> {
  // Step 1: Get signed URL (tiny request)
  const tokenResp = await fetch('/api/upload/token', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ key, filename, contentType: file.type }),
  });
  const { uploadUrl, publicUrl } = await tokenResp.json();

  // Step 2: Upload directly to object storage (no Vercel in the middle)
  await fetch(uploadUrl, {
    method: 'PUT',
    body: file,
    headers: { 'Content-Type': file.type },
  });

  return { url: publicUrl };
}

This pattern is essential for any media-heavy app on Vercel.

Lesson 4: i18n at Scale Is a Product Decision, Not a Technical One

Gliss supports 32 languages. Not 3, not 5 — thirty-two. Here's the i18n setup:

// routing.ts
import { defineRouting } from 'next-intl/routing';

export const routing = defineRouting({
  locales: SUPPORTED_LOCALE_CODES, // 32 locales
  defaultLocale: 'en',
  localePrefix: 'as-needed', // No /en prefix for English
});

The localePrefix: 'as-needed' was a Lighthouse win — it eliminated a ~790ms redirect from / → /en that was killing our performance score.

But the technical setup is the easy part. The real challenge is maintaining 32 translation files (each ~800 keys). Some lessons:

Use AI for initial translations, then have native speakers review. We used LLMs for the initial pass and then manually reviewed each locale. Pure AI translation makes embarrassing mistakes with music terminology.
Keep English terms for industry jargon. Words like "mastering," "stems," "BPM," and "MIDI" should stay in English in most languages. Musicians globally use these terms.
RTL languages (Arabic, Hebrew, Urdu, Persian) need layout testing, not just translation. Your entire flex layout can break. Test thoroughly.
Don't translate dynamically. Load all translations at build time. We use next-intl's server components to avoid shipping translation bundles to the client unnecessarily.

Lesson 5: Content Security Policy Will Break Everything You Love

When you add a proper CSP header, expect a day of whack-a-mole. Every external script, font, analytics pixel, and auth widget needs explicit permission:

value: [
  "default-src 'self'",
  "script-src 'self' 'unsafe-eval' 'unsafe-inline' https://your-auth-provider.com https://*.yourdomain.com",
  "connect-src 'self' https://*.yourdomain.com https: blob: data: wss:",
  "style-src 'self' 'unsafe-inline' https://fonts.googleapis.com",
  "font-src 'self' data: https://fonts.gstatic.com",
  "media-src 'self' https: blob: data:",
  "worker-src 'self' blob:",
].join('; ')

The blob: and data: entries in media-src are crucial for audio apps — Web Audio API creates blob URLs for playback, and OfflineAudioContext renders to data URIs.

Do it anyway. CSP is non-negotiable for production apps handling payments and user data.

Lesson 6: Optimizing Bundle Size With Next.js

Our initial bundle included the entirety of react-icons (which is enormous). Next.js's optimizePackageImports saved us:

experimental: {
  optimizePackageImports: [
    'react-icons/si',
    'react-icons/fa6',
    'react-icons/md',
    'react-icons/lu',
    'lucide-react',
    '@clerk/nextjs'
  ],
},

This tells Next.js to tree-shake these packages more aggressively. For react-icons alone it cut ~200KB from our bundle.

Other wins:

inlineCss: true — eliminates the CSS file request, reducing time-to-first-paint
Lazy loading heavy viewers (MIDI viewer, waveform renderer) with next/dynamic

What We'd Do Differently

Start with streaming architecture. Retrofitting streaming into a request-response mental model is painful.
Use S3-compatible direct uploads from the start. Don't route binary files through your API layer.
Set up CSP on day one. Adding it later means debugging every third-party integration you've already embedded.
Invest in i18n infrastructure early. Adding a 32nd language is easy when your pipeline is automated. Adding a 2nd language when you have hardcoded strings everywhere is a nightmare.
Build your audio pipeline with OfflineAudioContext first, then port to real-time. Getting offline rendering right guarantees your real-time version will be correct.

Try It

If you want to see all of this in action, Gliss is live. You can generate a song from a text description, master it in your browser, and export — no account required for your first few creations.

The music AI space is moving incredibly fast. If you're building anything with audio in the browser, we hope some of these lessons save you the debugging time we spent.

What's the hardest technical challenge you've hit building with audio in the browser? We'd love to hear about it in the comments.