Need a Tokenizer logic

Currently, I’m working on an AI-powered chatbot project for a bank that provides employees with quick answers to various policy-related queries, such as the Hajj Leave Policy or Special Occasion Leave Policy under the broader HR guidelines. Each policy is stored in a PDF document containing multiple categories, which brings unique challenges:

Dynamic Content: Policy documents are updated every 3–4 months, making it impractical to store policy details in a static database.
No Third-Party Services: Due to security restrictions in banking, we are unable to use external NLP or OCR services.
I’m looking for a tokenizer solution capable of segmenting each distinct policy within the PDF. I aim to train a model to classify and retrieve policies based on categories (like leave types) identified by the chatbot. Ideally, the tokenizer would accurately segment each policy without manual labelling every time the document is updated.

Does anyone have experience designing an in-house tokenizer that can handle unstructured PDF text and dynamically extract categorized information in such a setup? Any suggestions, tools, or techniques would be invaluable.

DEV Community

Need a Tokenizer logic

Top comments (0)

Read next

Optimizing PHP for High-Performance Web Applications

Decoding JavaScript Emoji Sorting with the Fitzpatrick Scale

Flutter vs React Native: Which Framework is Better for Mobile Apps?

Amigo language, 0.91