This morning I ran a real customer test on my own product — uploaded an actual PDF (a signed contract, 86 KB) to the demo endpoint of my invoice-extraction API.
The response came back in 800ms. Looked clean. Until I scrolled to rawText:
"rawText": "%PDF-1.4\n%����\n1 0 obj\n<</Title (...)\n/Filter /FlateDecode\n/Length 1679>> stream\nx��Y]o�\r}�_..."
That's not text. That's the raw binary stream of the PDF, UTF-8-decoded into garbage.
The bug
Looking at the code I had written for the demo route (/api/v1/demo):
const buffer = Buffer.from(await file.arrayBuffer());
if (file.type === DOCX_MIME) return extractFromDocx(buffer, ...);
if (file.type === XLSX_MIME) return extractFromXlsx(buffer, ...);
const text = buffer.toString('utf-8', 0, Math.min(buffer.length, 100000));
return extractFromText(text, ...);
DOCX ✅. XLSX ✅. PDF: missing branch — falls through to buffer.toString('utf-8').
For DOCX/XLSX it doesn't matter (those branches catch them by MIME). But PDFs are binary with FlateDecode-compressed text streams. UTF-8-decoding them produces unreadable bytes.
The fix
The production endpoint (/api/v1/extract) already had it right — I just hadn't replicated the pattern in the demo route. Two changes:
import pdfParse from 'pdf-parse';
export const runtime = 'nodejs';
export const maxDuration = 30;
// Magic bytes detection (defends against MIME spoofing)
const sig = buffer.subarray(0, 4);
const isPDFBytes = sig[0] === 0x25 && sig[1] === 0x50
&& sig[2] === 0x44 && sig[3] === 0x46; // %PDF
if (file.type === 'application/pdf' || file.name.endsWith('.pdf') || isPDFBytes) {
try {
const pdfData = await Promise.race([
pdfParse(buffer),
new Promise<never>((_, reject) =>
setTimeout(() => reject(new Error('PDF_PARSE_TIMEOUT')), 25_000)
),
]);
const text = pdfData.text || '';
if (text.length < 10) {
return NextResponse.json(
{ error: 'Could not extract text from PDF.', code: 'EXTRACTION_FAILED' },
{ status: 422 }
);
}
return NextResponse.json(extractFromText(text, ...));
} catch (err) {
if (err instanceof Error && err.message === 'PDF_PARSE_TIMEOUT') {
return NextResponse.json(
{ error: 'PDF processing timed out.', code: 'PDF_TIMEOUT' },
{ status: 504 }
);
}
return NextResponse.json(
{ error: 'PDF parsing failed.', code: 'PDF_PARSE_ERROR' },
{ status: 422 }
);
}
}
Three things worth calling out:
-
Promise.racewith timeout —pdfParseruns synchronously inside an async wrapper. A malformed or zip-bomb PDF can block the Node event loop. 25s ceiling protects the function. -
Magic bytes check (
%PDF=0x25 0x50 0x44 0x46) — never trust client-declared MIME alone. Helps with spoofedContent-Type. -
Explicit
runtime = 'nodejs'—pdf-parseuses NodeBufferAPIs that don't run on edge runtimes. On Vercel Fluid Compute this is the default, but stating it makes the contract loud.
After the fix
Re-uploaded the same 86 KB contract:
{
"documentType": "contract",
"confidence": 0.85,
"data": {
"title": "DealMirror Terms & Conditions — Acknowledged",
"fields": {
"acknowledged_by": "DevToolsmith (Antonio Altomonte)",
"redemption_period": "60 days from purchase",
"revenue_share_tier_accepted": "...",
...
}
}
}
14 structured fields. Real text. Shipped in 30 minutes from detection to production.
Lesson
Test your own demos with real customer data, not just the sample file in the playground. Sample files become orthogonal to the actual code path over time. The customer doesn't open your playground first — they hit the demo endpoint cold.
Try it: https://parseflow.dev (free API key, no card).
Top comments (0)