Sean Zhang

Posted on Feb 22

Auto-Detecting Base64 Media in JSON: Magic Numbers, Performance Tricks, and a State Machine Parser

#webdev #base64 #json #performance

If you work with multimodal AI APIs — image generation, TTS, vision models — you know the pain. The response comes back as JSON, and buried inside it there's a Base64 string that's 200,000 characters long. You know it's an audio clip or an image, but you can't hear it or see it — it's just a wall of characters. To actually preview it, you have to copy the string out, find an online decoder, paste it in, and wait.

I was testing a text-to-speech pipeline that generates audio in multiple voices. Each response contained 4–6 Base64 audio clips. My workflow was: find the string, copy it, open base64.guru, paste, hit play. Repeat for the next voice. Across a dozen prompt variations, that's ~60 round-trips to a decoder. I started thinking — what would it take to auto-detect which strings are media and render them inline?

Turns out the answer involves file format magic numbers, partial Base64 decoding, and a custom string scanner. Here's how it works.

Approach #1: Data URI Prefix Detection

The easy case first. Some APIs return Base64 with a full Data URI prefix:

{
  "avatar": "data:image/png;base64,iVBORw0KGgo..."
}

The MIME type is right there in the prefix. A regex handles this:

const DATA_URI_REGEX =
  /data:(image\/[^;"]+|video\/[^;"]+|audio\/[^;"]+|application\/pdf)[^"]*/g;

Match it, extract the MIME type, done. But this only covers APIs that bother including the prefix — and many don't.

Approach #2: Magic Number Detection

The harder case. Many APIs just return raw Base64 with no prefix at all:

{
  "photo": "iVBORw0KGgoAAAANSUhEUg..."
}

You might know it's an image because you called a vision API, but the tool doesn't. How does it figure out that iVBORw0KGgo... is a PNG — and specifically a PNG, not a JPEG or a GIF — so it can render the right <img> tag?

Every binary file format starts with a fixed byte sequence called a magic number (or file signature). PNG files always begin with ‰PNG (0x89 0x50 0x4E 0x47). JPEG starts with 0xFF 0xD8 0xFF. PDFs start with %PDF.

So the idea is: decode just the first few bytes of the Base64 string, and check them against a lookup table.

const MAGIC_NUMBERS: Record<string, number[]> = {
  "image/png": [0x89, 0x50, 0x4e, 0x47, 0x0d, 0x0a, 0x1a, 0x0a],
  "image/jpeg": [0xff, 0xd8, 0xff],
  "image/gif": [0x47, 0x49, 0x46, 0x38],
  "image/bmp": [0x42, 0x4d],
  "audio/ogg": [0x4f, 0x67, 0x67, 0x53],
  "audio/flac": [0x66, 0x4c, 0x61, 0x43],
  "video/webm": [0x1a, 0x45, 0xdf, 0xa3],
  "application/pdf": [0x25, 0x50, 0x44, 0x46],
  // ... 14 formats total
};

The critical performance trick: you only need to decode 64 bytes, not the entire string. Every 4 Base64 characters decode to 3 bytes, so ~88 characters is enough to identify any format:

function base64ToBytes(base64: string, maxBytes: number): Uint8Array {
  const charsNeeded = Math.ceil((maxBytes * 4) / 3);
  const aligned = Math.ceil(charsNeeded / 4) * 4; // Base64 alignment
  const slice = base64.slice(0, aligned);
  const binary = atob(slice);
  const bytes = new Uint8Array(binary.length);
  for (let i = 0; i < binary.length; i++) {
    bytes[i] = binary.charCodeAt(i);
  }
  return bytes;
}

// Usage: only decode first 64 bytes
const header = base64ToBytes(rawBase64, 64);

This means detecting a 100MB Base64 string takes less than 1ms. The full string is never decoded for detection purposes.

The RIFF Problem

There's a catch. WebP images, WAV audio, and AVI video all start with the same magic number: RIFF (0x52 0x49 0x46 0x46). You have to look deeper — at bytes 8–11 — to tell them apart:

function checkRiffSubtype(bytes: Uint8Array): string | null {
  if (bytes.length < 12) return null;
  const subtype = String.fromCharCode(bytes[8], bytes[9], bytes[10], bytes[11]);
  if (subtype === "WEBP") return "image/webp";
  if (subtype === "WAVE") return "audio/wav";
  if (subtype === "AVI ") return "video/x-msvideo";
  return null;
}

Similar disambiguation is needed for ftyp containers (MP4 vs M4A vs MOV — differentiated by brand codes like isom, M4A, qt) and MP3 (which can start with either an ID3v2 tag or a raw frame sync word 0xFF 0xFB).

The String Scanner: Why Not Regex?

Before you can run magic number detection, you have to find all the strings in the JSON. The obvious approach — a regex like /"([^"\\]|\\.)*"/g — has a fatal flaw: catastrophic backtracking.

A single Base64 string can be millions of characters long. Complex regex patterns with alternations will choke on them, often crashing the browser tab entirely.

The solution is a simple O(n) state machine that walks the text character by character:

function findJsonStrings(text: string) {
  const results = [];
  let i = 0;
  while (i < text.length) {
    if (text[i] === '"') {
      const start = i++;
      while (i < text.length) {
        if (text[i] === "\\")
          i += 2; // skip escaped char
        else if (text[i] === '"')
          break; // closing quote
        else i++;
      }
      if (i < text.length) {
        i++;
        results.push({
          from: start,
          to: i,
          value: text.slice(start + 1, i - 1),
        });
      }
    } else {
      i++;
    }
  }
  return results;
}

No backtracking, no stack overflow, handles arbitrarily large strings in linear time.

Performance: Handling Huge Base64 in a Code Editor

One practical problem: a single high-res image can produce a Base64 string of 10+ MB. If you feed that directly into a code editor like CodeMirror 6, the browser freezes.

The solution is a TruncationManager that intercepts content before it reaches the editor. Any string over 500 characters gets truncated:

Editor sees:  "iVBOR...(500 chars)__TRUNC_a1b2c3d4__"
Memory holds: the full multi-MB string in a Map

The editor only ever renders 500 characters plus a small placeholder token. Performance stays constant regardless of payload size.

But when the user copies text (Ctrl+C), a clipboard interceptor swaps the truncation token back with the full original content. So what you see is truncated, but what you copy is lossless.

This same mechanism also powers the reverse workflow: paste an image file into the editor, and it stores the Base64 internally. When you copy, you get the full Base64 string — ready to drop into your API request payload.

The Complete Pipeline

Putting it all together, here's how each JSON string value gets processed:

String Found
  ├─ Has "data:xxx;base64," prefix?
  │   └─ YES → extract MIME from prefix → render
  │
  ├─ Is pure Base64? (charset check + min 100 chars)
  │   └─ YES → decode 64 bytes → match magic number → render
  │
  └─ Is URL? (http:// + media file extension)
      └─ YES → load URL directly → render

Each path feeds into the same rendering layer: images get <img> tags, audio gets <audio> players, video gets <video>, and PDFs get embedded viewers.

Results are cached by a content hash to avoid re-detection on subsequent scans.

Stack

Astro + React (islands architecture for minimal JS)
CodeMirror 6 (editor engine)
Deployed on Cloudflare Pages

If you work with multimodal AI APIs and spend time decoding Base64 strings manually, the tool is at viewjson.net.

DEV Community