Sriyansh Gupta

Posted on Jun 10

5 Obstacles I Hit Building PDF RAG in Next.js 15 — And How I Fixed Each One published: true

#ai #webdev #rag #react

Building PDF RAG in NochBot almost broke me.

Here's every obstacle I hit — and exactly
how I got through them.

Obstacle 1 — The PDF Parser Nightmare

Started with pdf-parse. Simple. Popular.
Everyone uses it.

First error:
ReferenceError: DOMMatrix is not defined

Turns out pdf-parse loads browser APIs
at module level — Node.js has no idea
what DOMMatrix is.

Switched to pdf-parse-new.
New error:
Module not found: Can't resolve
'./ROOT/node_modules/pdf-parse-new/lib/pdf-child.js'

Broken internal paths.
Incompatible with Turbopack.

Third attempt — unpdf.
Built specifically for edge/serverless environments.
No browser dependencies.
Dynamic import inside the function body.

export async function parsePDF(buffer: Buffer) {
  const { extractText } = await import('unpdf');
  const uint8 = new Uint8Array(buffer);
  const { text, totalPages } = await extractText(uint8, { 
    mergePages: true 
  });
  return { text, pageCount: totalPages };
}

Finally worked. ✅

Lesson: Library choice matters more
than your actual logic.
Always check Turbopack compatibility first.

Obstacle 2 — Chunking Was Killing Retrieval

First version chunked on fixed word count.
300 words. Clean. Simple.

But retrieval was terrible.

A heading would end up in one chunk.
Its explanation in the next.
The LLM had zero context to work with.

The fix — paragraph boundary chunking
with overlap:

export function chunkText(markdown: string) {
  const blocks = markdown
    .split(/\n\n+/)
    .map(b => b.trim())
    .filter(Boolean);

  const chunks = [];
  let currentWords = [];
  let currentBlocks = [];
  const wordsPerChunk = 250;
  const overlapWords = 25;

  for (const block of blocks) {
    const blockWords = block.split(/\s+/).filter(Boolean);

    if (currentWords.length + blockWords.length > wordsPerChunk 
        && currentWords.length > 0) {
      chunks.push(currentBlocks.join('\n\n'));
      const overlapText = currentWords.slice(-overlapWords).join(' ');
      currentWords = overlapText ? overlapText.split(' ') : [];
      currentBlocks = overlapText ? [overlapText] : [];
    }

    currentWords.push(...blockWords);
    currentBlocks.push(block);
  }

  if (currentBlocks.length) chunks.push(currentBlocks.join('\n\n'));
  return chunks;
}

Split on meaning — not word count.
25-word overlap ensures context isn't lost
at chunk boundaries.

Retrieval quality jumped immediately.

Lesson: How you break data apart
matters as much as how you store it.

Obstacle 3 — Blank UI Despite Working Backend

Vectorization was working perfectly.
Data storing in Qdrant — confirmed.

But the Vectorize tab in dashboard?
Completely blank.

Spent 2 hours convinced it was
a backend issue.

It was a useState that wasn't updating.

The API was returning data correctly.
The component was receiving it.
But the state update was never
triggering a re-render.

// Wrong — mutating state directly
const data = await res.json();
vectorizeData = data; // ❌ no re-render

// Correct
const data = await res.json();
setVectorizeData(data); // ✅ triggers re-render

One line fix. 2 hours wasted.

Lesson: When backend works but
UI doesn't — check state management first.
Always.

Obstacle 4 — LLM Overthinking Simple Questions

Asked the bot:

"How many pages are in this PDF?"

Got back:

"Step 1 — Analyze the PDF context...
Step 2 — Identify key concepts...
Step 3 — Determine the answer..."

300 words. For a 2-word answer.

Two fixes:

For simple factual questions —
answer in 1-2 sentences ONLY.
NEVER show reasoning steps.
Just give the final answer.

Fix 2 — Temperature:

// Before
temperature: 0.6  // too creative for facts

// After  
temperature: 0.2  // precise and direct

Both fixes together eliminated
the overthinking completely.

Lesson: If your LLM output is bad —
your prompt is the problem.
Not the model.

Obstacle 5 — PDFs Disappearing On Navigation

Users navigating away from the PDF page
and coming back — their uploaded PDFs
were visually gone.

But only visually. Data was still in the DB.

Root cause: Next.js App Router was
caching the page. Component wasn't
remounting. The fetch wasn't re-triggering.

Fix — visibilitychange event listener:

useEffect(() => {
  const handleVisibilityChange = () => {
    if (document.visibilityState === 'visible') {
      fetchPdfs(); // re-fetch when tab becomes active
    }
  };

  document.addEventListener(
    'visibilitychange', 
    handleVisibilityChange
  );

  return () => document.removeEventListener(
    'visibilitychange', 
    handleVisibilityChange
  );
}, []);

Every time the tab/page becomes active —
re-fetch the PDF list.
Works regardless of how the user navigated back.

Lesson: Next.js App Router caching
will catch you off guard.
Always handle the visibilitychange event
for data that needs to stay fresh.

The Full Stack

Here's everything I used to build this:

Layer	Technology
Framework	Next.js 15 + Turbopack
PDF Parsing	unpdf
Embeddings	Google Gemini text-embedding-004
Vector DB	Qdrant Cloud
LLM	Groq LLaMA 3.3 70B
Database	MongoDB Atlas
Deployment	Vercel

What's Live Now

All of this is running in production
inside NochBot —
a multi-tenant AI chatbot SaaS.

Features shipped:

PDF upload + vectorization
Shared chat links per PDF
Session analytics — see every conversation
Dual mode — study assistant + sales catalog bot
Stop generation mid-response
Free plan: 1 PDF limit

TL;DR

Use unpdf not pdf-parse in Next.js 15
Chunk on paragraph boundaries, not word count
When UI is blank — check useState first
Low temperature + explicit prompts = no overthinking
Use visibilitychange to handle App Router caching

If you're building RAG — save this.
These are the exact walls you're going to hit.

Building NochBot solo — documenting
everything as I go.
Follow for more real-world RAG + Next.js content.

Top comments (1)

Tae Kim • Jun 10

The paragraph boundary fix landed for me on a trade news RAG. Hard word-count splits broke entity resolution because company names and metrics would land in separate chunks. 25-word overlap caught most of those edge cases.