Building an Offline Document Search Engine for My University

#python #devops #programming #opensource

Hi everyone,

I'm Yiğit, a third-year software engineering student. If you have ever tried to find a specific rule about grading, attendance, or academic calendars in university regulation PDFs, you know how frustrating it can be. Information is scattered across dozens of poorly formatted documents, making it almost impossible to find quick answers.

To solve this, I built FiratUniversityChatbot—an open-source, completely offline Turkish question-answering and document search assistant tailored for Fırat University.

The Core Idea

Instead of relying on heavy cloud-based LLMs that might hallucinate academic rules, I wanted a fast, deterministic, and fully local system. The goal was simple: users ask a question, and the app instantly scans local PDFs to return the exact snippet and the source page number. If the answer isn't in the documents, it politely refuses to guess, ensuring zero hallucination.

The Tech Stack

To keep the application lightweight, secure, and robust without needing an internet connection, I went with a pure Information Retrieval approach:

Backend & API: Python and FastAPI for high performance.
Document Processing: pdfplumber to handle the nightmare of university PDFs. The app dynamically detects single or dual columns, filters out headers/footers, and accurately assembles text blocks.
Search Engine: A custom-built BM25 index tailored specifically for the Turkish language. I implemented ASCII normalization, tokenization, synonym expansion (e.g., treating "büt" and "bütünleme" as the exact same intent), and bigram matching.
Frontend: A minimal, dependency-free chat interface using HTML, CSS, and Jinja2 templates.

The Biggest Challenge

The hardest part was definitely the data extraction and text pipeline. University PDFs are notoriously messy. Building a fallback strategy that can accurately crop dual columns without mixing up paragraphs was a headache. Additionally, fine-tuning the BM25 ranking algorithm with domain-aware tweaks—like intent flags for "pass grade" or "appeals"—took a lot of trial and error to get right.

Open for Feedback

The project is completely open-source and ready to be tested. You can run it locally in a virtual environment or deploy it easily via Docker (it's currently live on Hugging Face Spaces too).

If you are interested in search engines, document parsing, or building fast Python applications, I would love for you to check it out.

Repository: FiratUniversityChatbot on GitHub

I’m highly open to code reviews, architectural feedback, or Pull Requests. Have you ever built a local search tool for your school or company? Let's discuss in the comments!

DEV Community

Building an Offline Document Search Engine for My University

The Core Idea

The Tech Stack

The Biggest Challenge

Open for Feedback

Top comments (0)