DEV Community

Cover image for 5 Obstacles I Hit Building PDF RAG in Next.js 15 — And How I Fixed Each One published: true
Sriyansh Gupta
Sriyansh Gupta

Posted on

5 Obstacles I Hit Building PDF RAG in Next.js 15 — And How I Fixed Each One published: true

Building PDF RAG in NochBot almost broke me.

Here's every obstacle I hit — and exactly
how I got through them.


Obstacle 1 — The PDF Parser Nightmare

Started with pdf-parse. Simple. Popular.
Everyone uses it.

First error:
ReferenceError: DOMMatrix is not defined

Turns out pdf-parse loads browser APIs
at module level — Node.js has no idea
what DOMMatrix is.

Switched to pdf-parse-new.
New error:
Module not found: Can't resolve
'./ROOT/node_modules/pdf-parse-new/lib/pdf-child.js'

Broken internal paths.
Incompatible with Turbopack.

Third attempt — unpdf.
Built specifically for edge/serverless environments.
No browser dependencies.
Dynamic import inside the function body.

export async function parsePDF(buffer: Buffer) {
  const { extractText } = await import('unpdf');
  const uint8 = new Uint8Array(buffer);
  const { text, totalPages } = await extractText(uint8, { 
    mergePages: true 
  });
  return { text, pageCount: totalPages };
}
Enter fullscreen mode Exit fullscreen mode

Finally worked. ✅

Lesson: Library choice matters more
than your actual logic.
Always check Turbopack compatibility first.


Obstacle 2 — Chunking Was Killing Retrieval

First version chunked on fixed word count.
300 words. Clean. Simple.

But retrieval was terrible.

A heading would end up in one chunk.
Its explanation in the next.
The LLM had zero context to work with.

The fix — paragraph boundary chunking
with overlap:

export function chunkText(markdown: string) {
  const blocks = markdown
    .split(/\n\n+/)
    .map(b => b.trim())
    .filter(Boolean);

  const chunks = [];
  let currentWords = [];
  let currentBlocks = [];
  const wordsPerChunk = 250;
  const overlapWords = 25;

  for (const block of blocks) {
    const blockWords = block.split(/\s+/).filter(Boolean);

    if (currentWords.length + blockWords.length > wordsPerChunk 
        && currentWords.length > 0) {
      chunks.push(currentBlocks.join('\n\n'));
      const overlapText = currentWords.slice(-overlapWords).join(' ');
      currentWords = overlapText ? overlapText.split(' ') : [];
      currentBlocks = overlapText ? [overlapText] : [];
    }

    currentWords.push(...blockWords);
    currentBlocks.push(block);
  }

  if (currentBlocks.length) chunks.push(currentBlocks.join('\n\n'));
  return chunks;
}
Enter fullscreen mode Exit fullscreen mode

Split on meaning — not word count.
25-word overlap ensures context isn't lost
at chunk boundaries.

Retrieval quality jumped immediately.

Lesson: How you break data apart
matters as much as how you store it.


Obstacle 3 — Blank UI Despite Working Backend

Vectorization was working perfectly.
Data storing in Qdrant — confirmed.

But the Vectorize tab in dashboard?
Completely blank.

Spent 2 hours convinced it was
a backend issue.

It was a useState that wasn't updating.

The API was returning data correctly.
The component was receiving it.
But the state update was never
triggering a re-render.

// Wrong — mutating state directly
const data = await res.json();
vectorizeData = data; // ❌ no re-render

// Correct
const data = await res.json();
setVectorizeData(data); // ✅ triggers re-render
Enter fullscreen mode Exit fullscreen mode

One line fix. 2 hours wasted.

Lesson: When backend works but
UI doesn't — check state management first.
Always.


Obstacle 4 — LLM Overthinking Simple Questions

Asked the bot:

"How many pages are in this PDF?"

Got back:

"Step 1 — Analyze the PDF context...
Step 2 — Identify key concepts...
Step 3 — Determine the answer..."

300 words. For a 2-word answer.

Two fixes:


For simple factual questions —
answer in 1-2 sentences ONLY.
NEVER show reasoning steps.
Just give the final answer.

Fix 2 — Temperature:

// Before
temperature: 0.6  // too creative for facts

// After  
temperature: 0.2  // precise and direct
Enter fullscreen mode Exit fullscreen mode

Both fixes together eliminated
the overthinking completely.

Lesson: If your LLM output is bad —
your prompt is the problem.
Not the model.


Obstacle 5 — PDFs Disappearing On Navigation

Users navigating away from the PDF page
and coming back — their uploaded PDFs
were visually gone.

But only visually. Data was still in the DB.

Root cause: Next.js App Router was
caching the page. Component wasn't
remounting. The fetch wasn't re-triggering.

Fix — visibilitychange event listener:

useEffect(() => {
  const handleVisibilityChange = () => {
    if (document.visibilityState === 'visible') {
      fetchPdfs(); // re-fetch when tab becomes active
    }
  };

  document.addEventListener(
    'visibilitychange', 
    handleVisibilityChange
  );

  return () => document.removeEventListener(
    'visibilitychange', 
    handleVisibilityChange
  );
}, []);
Enter fullscreen mode Exit fullscreen mode

Every time the tab/page becomes active —
re-fetch the PDF list.
Works regardless of how the user navigated back.

Lesson: Next.js App Router caching
will catch you off guard.
Always handle the visibilitychange event
for data that needs to stay fresh.


The Full Stack

Here's everything I used to build this:

Layer Technology
Framework Next.js 15 + Turbopack
PDF Parsing unpdf
Embeddings Google Gemini text-embedding-004
Vector DB Qdrant Cloud
LLM Groq LLaMA 3.3 70B
Database MongoDB Atlas
Deployment Vercel

What's Live Now

All of this is running in production
inside NochBot
a multi-tenant AI chatbot SaaS.

Features shipped:

  • PDF upload + vectorization
  • Shared chat links per PDF
  • Session analytics — see every conversation
  • Dual mode — study assistant + sales catalog bot
  • Stop generation mid-response
  • Free plan: 1 PDF limit

TL;DR

  1. Use unpdf not pdf-parse in Next.js 15
  2. Chunk on paragraph boundaries, not word count
  3. When UI is blank — check useState first
  4. Low temperature + explicit prompts = no overthinking
  5. Use visibilitychange to handle App Router caching

If you're building RAG — save this.
These are the exact walls you're going to hit.


Building NochBot solo — documenting
everything as I go.

Follow for more real-world RAG + Next.js content.

Top comments (1)

Collapse
 
hannune profile image
Tae Kim

The paragraph boundary fix landed for me on a trade news RAG. Hard word-count splits broke entity resolution because company names and metrics would land in separate chunks. 25-word overlap caught most of those edge cases.