DEV Community

DevToolsmith
DevToolsmith

Posted on

I shipped a "PDF to JSON" API and forgot to handle PDFs. Here's the 30-min fix.

This morning I ran a real customer test on my own product — uploaded an actual PDF (a signed contract, 86 KB) to the demo endpoint of my invoice-extraction API.

The response came back in 800ms. Looked clean. Until I scrolled to rawText:

"rawText": "%PDF-1.4\n%����\n1 0 obj\n<</Title (...)\n/Filter /FlateDecode\n/Length 1679>> stream\nx��Y]o�\r}�_..."
Enter fullscreen mode Exit fullscreen mode

That's not text. That's the raw binary stream of the PDF, UTF-8-decoded into garbage.

The bug

Looking at the code I had written for the demo route (/api/v1/demo):

const buffer = Buffer.from(await file.arrayBuffer());

if (file.type === DOCX_MIME) return extractFromDocx(buffer, ...);
if (file.type === XLSX_MIME) return extractFromXlsx(buffer, ...);

const text = buffer.toString('utf-8', 0, Math.min(buffer.length, 100000));
return extractFromText(text, ...);
Enter fullscreen mode Exit fullscreen mode

DOCX ✅. XLSX ✅. PDF: missing branch — falls through to buffer.toString('utf-8').

For DOCX/XLSX it doesn't matter (those branches catch them by MIME). But PDFs are binary with FlateDecode-compressed text streams. UTF-8-decoding them produces unreadable bytes.

The fix

The production endpoint (/api/v1/extract) already had it right — I just hadn't replicated the pattern in the demo route. Two changes:

import pdfParse from 'pdf-parse';

export const runtime = 'nodejs';
export const maxDuration = 30;

// Magic bytes detection (defends against MIME spoofing)
const sig = buffer.subarray(0, 4);
const isPDFBytes = sig[0] === 0x25 && sig[1] === 0x50
                && sig[2] === 0x44 && sig[3] === 0x46; // %PDF

if (file.type === 'application/pdf' || file.name.endsWith('.pdf') || isPDFBytes) {
  try {
    const pdfData = await Promise.race([
      pdfParse(buffer),
      new Promise<never>((_, reject) =>
        setTimeout(() => reject(new Error('PDF_PARSE_TIMEOUT')), 25_000)
      ),
    ]);
    const text = pdfData.text || '';
    if (text.length < 10) {
      return NextResponse.json(
        { error: 'Could not extract text from PDF.', code: 'EXTRACTION_FAILED' },
        { status: 422 }
      );
    }
    return NextResponse.json(extractFromText(text, ...));
  } catch (err) {
    if (err instanceof Error && err.message === 'PDF_PARSE_TIMEOUT') {
      return NextResponse.json(
        { error: 'PDF processing timed out.', code: 'PDF_TIMEOUT' },
        { status: 504 }
      );
    }
    return NextResponse.json(
      { error: 'PDF parsing failed.', code: 'PDF_PARSE_ERROR' },
      { status: 422 }
    );
  }
}
Enter fullscreen mode Exit fullscreen mode

Three things worth calling out:

  1. Promise.race with timeoutpdfParse runs synchronously inside an async wrapper. A malformed or zip-bomb PDF can block the Node event loop. 25s ceiling protects the function.
  2. Magic bytes check (%PDF = 0x25 0x50 0x44 0x46) — never trust client-declared MIME alone. Helps with spoofed Content-Type.
  3. Explicit runtime = 'nodejs'pdf-parse uses Node Buffer APIs that don't run on edge runtimes. On Vercel Fluid Compute this is the default, but stating it makes the contract loud.

After the fix

Re-uploaded the same 86 KB contract:

{
  "documentType": "contract",
  "confidence": 0.85,
  "data": {
    "title": "DealMirror Terms & Conditions — Acknowledged",
    "fields": {
      "acknowledged_by": "DevToolsmith (Antonio Altomonte)",
      "redemption_period": "60 days from purchase",
      "revenue_share_tier_accepted": "...",
      ...
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

14 structured fields. Real text. Shipped in 30 minutes from detection to production.

Lesson

Test your own demos with real customer data, not just the sample file in the playground. Sample files become orthogonal to the actual code path over time. The customer doesn't open your playground first — they hit the demo endpoint cold.

Try it: https://parseflow.dev (free API key, no card).

Top comments (0)