Building PDF RAG in NochBot almost broke me.
Here's every obstacle I hit — and exactly
how I got through them.
Obstacle 1 — The PDF Parser Nightmare
Started with pdf-parse. Simple. Popular.
Everyone uses it.
First error:
ReferenceError: DOMMatrix is not defined
Turns out pdf-parse loads browser APIs
at module level — Node.js has no idea
what DOMMatrix is.
Switched to pdf-parse-new.
New error:
Module not found: Can't resolve
'./ROOT/node_modules/pdf-parse-new/lib/pdf-child.js'
Broken internal paths.
Incompatible with Turbopack.
Third attempt — unpdf.
Built specifically for edge/serverless environments.
No browser dependencies.
Dynamic import inside the function body.
export async function parsePDF(buffer: Buffer) {
const { extractText } = await import('unpdf');
const uint8 = new Uint8Array(buffer);
const { text, totalPages } = await extractText(uint8, {
mergePages: true
});
return { text, pageCount: totalPages };
}
Finally worked. ✅
Lesson: Library choice matters more
than your actual logic.
Always check Turbopack compatibility first.
Obstacle 2 — Chunking Was Killing Retrieval
First version chunked on fixed word count.
300 words. Clean. Simple.
But retrieval was terrible.
A heading would end up in one chunk.
Its explanation in the next.
The LLM had zero context to work with.
The fix — paragraph boundary chunking
with overlap:
export function chunkText(markdown: string) {
const blocks = markdown
.split(/\n\n+/)
.map(b => b.trim())
.filter(Boolean);
const chunks = [];
let currentWords = [];
let currentBlocks = [];
const wordsPerChunk = 250;
const overlapWords = 25;
for (const block of blocks) {
const blockWords = block.split(/\s+/).filter(Boolean);
if (currentWords.length + blockWords.length > wordsPerChunk
&& currentWords.length > 0) {
chunks.push(currentBlocks.join('\n\n'));
const overlapText = currentWords.slice(-overlapWords).join(' ');
currentWords = overlapText ? overlapText.split(' ') : [];
currentBlocks = overlapText ? [overlapText] : [];
}
currentWords.push(...blockWords);
currentBlocks.push(block);
}
if (currentBlocks.length) chunks.push(currentBlocks.join('\n\n'));
return chunks;
}
Split on meaning — not word count.
25-word overlap ensures context isn't lost
at chunk boundaries.
Retrieval quality jumped immediately.
Lesson: How you break data apart
matters as much as how you store it.
Obstacle 3 — Blank UI Despite Working Backend
Vectorization was working perfectly.
Data storing in Qdrant — confirmed.
But the Vectorize tab in dashboard?
Completely blank.
Spent 2 hours convinced it was
a backend issue.
It was a useState that wasn't updating.
The API was returning data correctly.
The component was receiving it.
But the state update was never
triggering a re-render.
// Wrong — mutating state directly
const data = await res.json();
vectorizeData = data; // ❌ no re-render
// Correct
const data = await res.json();
setVectorizeData(data); // ✅ triggers re-render
One line fix. 2 hours wasted.
Lesson: When backend works but
UI doesn't — check state management first.
Always.
Obstacle 4 — LLM Overthinking Simple Questions
Asked the bot:
"How many pages are in this PDF?"
Got back:
"Step 1 — Analyze the PDF context...
Step 2 — Identify key concepts...
Step 3 — Determine the answer..."
300 words. For a 2-word answer.
Two fixes:
For simple factual questions —
answer in 1-2 sentences ONLY.
NEVER show reasoning steps.
Just give the final answer.
Fix 2 — Temperature:
// Before
temperature: 0.6 // too creative for facts
// After
temperature: 0.2 // precise and direct
Both fixes together eliminated
the overthinking completely.
Lesson: If your LLM output is bad —
your prompt is the problem.
Not the model.
Obstacle 5 — PDFs Disappearing On Navigation
Users navigating away from the PDF page
and coming back — their uploaded PDFs
were visually gone.
But only visually. Data was still in the DB.
Root cause: Next.js App Router was
caching the page. Component wasn't
remounting. The fetch wasn't re-triggering.
Fix — visibilitychange event listener:
useEffect(() => {
const handleVisibilityChange = () => {
if (document.visibilityState === 'visible') {
fetchPdfs(); // re-fetch when tab becomes active
}
};
document.addEventListener(
'visibilitychange',
handleVisibilityChange
);
return () => document.removeEventListener(
'visibilitychange',
handleVisibilityChange
);
}, []);
Every time the tab/page becomes active —
re-fetch the PDF list.
Works regardless of how the user navigated back.
Lesson: Next.js App Router caching
will catch you off guard.
Always handle the visibilitychange event
for data that needs to stay fresh.
The Full Stack
Here's everything I used to build this:
| Layer | Technology |
|---|---|
| Framework | Next.js 15 + Turbopack |
| PDF Parsing | unpdf |
| Embeddings | Google Gemini text-embedding-004 |
| Vector DB | Qdrant Cloud |
| LLM | Groq LLaMA 3.3 70B |
| Database | MongoDB Atlas |
| Deployment | Vercel |
What's Live Now
All of this is running in production
inside NochBot —
a multi-tenant AI chatbot SaaS.
Features shipped:
- PDF upload + vectorization
- Shared chat links per PDF
- Session analytics — see every conversation
- Dual mode — study assistant + sales catalog bot
- Stop generation mid-response
- Free plan: 1 PDF limit
TL;DR
- Use
unpdfnotpdf-parsein Next.js 15 - Chunk on paragraph boundaries, not word count
- When UI is blank — check useState first
- Low temperature + explicit prompts = no overthinking
- Use
visibilitychangeto handle App Router caching
If you're building RAG — save this.
These are the exact walls you're going to hit.
Building NochBot solo — documenting
everything as I go.
Follow for more real-world RAG + Next.js content.
Top comments (1)
The paragraph boundary fix landed for me on a trade news RAG. Hard word-count splits broke entity resolution because company names and metrics would land in separate chunks. 25-word overlap caught most of those edge cases.