longtermemory

Posted on Apr 23

How I built an AI RAG system to convert PDF to Q&As

#ai #programming #rag #learning

Turning a PDF into a set of spaced repetition flashcards sounds like a single step. Upload file, get questions. In practice it is five distinct engineering stages, each with its own failure modes, and at least two of them surprised me in ways I did not anticipate when I first sketched the design.

If you want to see the end result in action, try it at LongTermMemory before reading the rest.

This post walks through the full document processing pipeline behind LongTermMemory: how text gets extracted from different file formats, how it gets split into semantically coherent chunks, how those chunks land in a vector database, and how retrieval augmented generation makes the final questions better than they would be if each chunk were processed in isolation.

The Architecture in Brief

The backend is split across two services. Laravel handles authentication, project management, file uploads to MinIO, and persisting the generated question and answer pairs to MySQL. A separate FastAPI service, written in Python, owns everything that requires AI: document processing, embedding generation, vector storage, and question generation.

The two services communicate asynchronously. Laravel dispatches a job to Celery via Redis. The FastAPI worker picks it up, runs the full pipeline, and POSTs the results back to a Laravel webhook. Laravel stores them and notifies the frontend via Server Sent Events that the job is done.

The pipeline itself has five stages:

Download the document from MinIO
Extract text, format by format
Semantic chunking
Embed and store vectors in Qdrant
Generate question and answer pairs with retrieval context

Stage 1: Getting Text Out of the Document

The service supports PDF, DOCX, and XLSX. Each format has its own extraction path.

DOCX and XLSX are straightforward. python-docx walks paragraphs and tables; openpyxl iterates sheets and rows. The output is joined with double newlines to preserve paragraph structure.

PDF is where the interesting engineering happens. The naive approach, calling page.get_text() directly, produces incoherent output for any document with a two column layout. Academic papers, textbooks, and journal articles all share this format, and naive extraction interleaves the columns instead of reading each one from top to bottom.

The fix is block based extraction with column detection. The algorithm sorts text blocks horizontally, detects gaps wider than 10% of the page width as column boundaries, then sorts each column vertically. Column one top to bottom, then column two top to bottom. Reading order preserved.

There is also a metadata filter that strips running headers, footers, and page numbers without touching content. The heuristic checks for short numeric text, Roman numeral patterns, and short text appearing in the top or bottom 5% of the page. The 5% margin was chosen carefully: a wider margin starts clipping section headings, which are real content.

I initially used default text extraction and did not notice the problem until I tested with a biology textbook. Half the generated questions were incoherent because the extraction had interleaved columns. I assumed this would be an edge case. It turned out to affect every academic PDF in the test set.

Stage 2: Semantic Chunking

Raw text goes into a two stage chunking pipeline using LlamaIndex.

The first stage uses SentenceSplitter with paragraph boundaries (\n\n) to produce structurally coherent initial chunks. The second stage uses SemanticSplitterNodeParser to merge or split those chunks based on embedding similarity. Semantically related sentences stay together; topic shifts become chunk boundaries.

The breakpoint_percentile_threshold is set to 95. This means the splitter only creates a new boundary when the cosine dissimilarity between adjacent sentence groups exceeds the 95th percentile of all dissimilarities in the document. The setting is conservative on purpose: splitting too aggressively produces small, context poor chunks that generate weak questions.

For long documents, the parameters shift dynamically. A document that exceeds a configurable token threshold gets larger initial chunk sizes and a higher breakpoint threshold. This reduces the number of embedding API calls at the cost of slightly less granular questions. For a 300 page textbook, a small reduction in granularity is a reasonable tradeoff to keep costs under control.

One thing worth noting: semantic chunking is expensive for large documents. The second stage embeds every sentence group to compute similarity scores. For a long document, that is a significant number of API calls before question generation even starts. A simpler paragraph based split with fixed sizes would be faster. The tradeoff is real and scales with document length.

Stage 3: Section Titles and Vector Storage

Each chunk needs a section title so the language model knows what it is reading. The approach has two priority levels.

Priority one is structural headings. The parser looks for headings in node metadata and in the first line of each chunk. Numbered section patterns, common keywords like Introduction or Conclusion, or title case short lines all qualify.

Priority two is LLM generated titles. When no structural heading is found, GPT-3.5-turbo generates a title from the first 500 characters of the chunk. The prompt is tight: 10 words maximum, no preamble, just the title. Temperature is 0.3 to keep output deterministic.

At the end of processing, the service logs a breakdown: how many chunks got structural headings, how many got generated titles, and how many ended up with neither. That last number is an error condition, not a warning.

Each chunk is then embedded using text-embedding-3-small in batches of 25. Embeddings land in Qdrant, scoped per project. A collection named project_42 holds all vectors for project 42. The chunk text is stored directly in the vector payload alongside metadata: user ID, project ID, document ID, chunk index, section title, and page number. Storing the full text in the payload avoids a round trip back to the database during retrieval.

Stage 4: RAG Context Enrichment

This is the part I most underestimated.

The naive approach is to generate one question and answer pair per chunk in isolation. That works, but it produces questions that miss connections across the document. A question about the role of mitochondria is weaker if the model does not know the same document has a section on ATP synthesis two chunks away.

The fix is to retrieve semantically related chunks before generating each question. For every chunk, the service embeds the chunk text, queries Qdrant for the top similar chunks, filters out the chunk itself and any result below a similarity score of 0.7, and passes the remaining related chunks into the generation prompt as supplementary context.

The 0.7 threshold was tuned empirically. Below it, unrelated chunks start appearing in the context. Above 0.8, too many chunks find no related context at all.

The generation prompt assembles five sections in order: the user's learning goals if they provided any, document context including title and section, the primary chunk, the related chunks with their similarity scores, and the base system instructions.

User notes are sanitized before insertion. The service runs a validation function that checks for prompt injection patterns, truncates to 2000 characters, and wraps the content in delimiters so the model is told to treat only the content between USER_NOTES_START and USER_NOTES_END as user input. It is a partial mitigation, not a complete solution, but it covers the common cases.

The model responds with structured JSON: question, answer, key concepts, and difficulty level. Key concepts and difficulty are stored and surfaced in the study UI so users can see what each card tests and filter by difficulty.

Async Orchestration and Progress Tracking

The pipeline is fully async. The Celery worker updates a Redis key at each stage, from 40% at text extraction through to 95% when costs are being calculated. The frontend polls a status endpoint and renders a live progress bar.

Progress tracking through Redis is worth the complexity. Early versions had no feedback: users saw a spinner and waited. Adding per stage updates and a Server Sent Events endpoint on the Laravel side was about a day of work. Users tolerate a three minute wait much better when they can see the pipeline moving than when the UI is silent.

Worker configuration also matters. worker_prefetch_multiplier=1 ensures one task at a time: document processing is memory intensive and running tasks in parallel on the same worker causes out of memory kills. worker_max_tasks_per_child=100 recycles the worker process after 100 tasks to prevent memory accumulation from PyMuPDF and openpyxl state.

What I Would Do Differently

Track real API costs from the OpenAI response object rather than estimating. The estimate tracks within 10% of actual cost, but reading response.usage.prompt_tokens and response.usage.completion_tokens from every call would remove the approximation entirely. It is about 10 lines of code.

Move section title generation into an async batch after chunking completes. Currently it runs synchronously inside the chunking stage, which serializes title generation across all chunks. Running it as a parallel batch would reduce overall latency.

Populate the page number field consistently across all extraction paths. The data model has the field; it is just not reliably filled in for every format. Surfacing the source page in the UI so users can look up the original passage would be a straightforward quality of life improvement.

Closing Thought

The pipeline from upload to flashcard is less visible than the spaced repetition scheduler, but it is more load bearing. If the questions are poor, no scheduling algorithm fixes them. Extraction quality, chunking strategy, and context enrichment all directly determine what ends up in the study queue.

Try the full pipeline yourself at LongTermMemory and see what it generates from your own materials.

DEV Community