Ayabonga Qwabi

Posted on Aug 3

Turn History PDF Books into AI-Ready Q&A Datasets with This Python Tool!

#machinelearning #python #ai

Hey guys! I’m thrilled to share a Python tool I’ve built that transforms history books (in PDF format) into structured Q&A datasets, perfect for fine-tuning AI models. Whether you’re an AI researcher, a history enthusiast, or a data scientist, this tool makes it easy to generate high-quality datasets from historical texts—and it’s flexible enough to work with any PDF book! 🚀

What Does It Do?

The History Book to Dataset Generator uses natural language processing and local AI models (via Ollama) to extract text from PDFs, chunk it intelligently, and generate contextual Q&A pairs. The output is a JSONL file ready for fine-tuning models like Llama 3.1 or Mistral. Plus, it’s packed with features to streamline the process and ensure quality.

Key Features:

PDF Processing: Extracts and chunks text from PDF files for efficient processing.
AI-Powered Q&A: Generates contextual questions and answers using Ollama’s local AI models.
Historical Focus: Filters content for historical figures, events, and cultural practices using customizable keywords.
Parallel Processing: Speeds up dataset creation with multi-threaded processing.
Resume Capability: Checkpoints let you pick up where you left off if interrupted.
Deduplication: Ensures dataset quality by removing similar Q&A pairs.
Customizable: Adaptable for any PDF book by tweaking keywords in a simple keywords.txt file.

Example Output:

{
"instruction": "What event occurred in 1962 related to south africa?", 
"input": "In 1961 South Africa became an independent republic  and in 1962 the international court \nrecognised South Africa’s control of South-West Afr ica (now Namibia).", 
"output": "the international court recognised south africa's control over south-west africa (now namibia) in 1962, marking a significant recognition of its sovereignty."
}

Why It’s Great for Fine-Tuning AI Models

The tool produces clean, structured datasets in JSONL format, making it ideal for fine-tuning language models for tasks like question-answering, historical analysis, or domain-specific NLP. You can customize keywords to focus on specific domains (e.g., African, European, or American history) or adapt it for non-historical PDFs, like technical manuals or literature, by updating the keywords.txt file.

How to Get Started

Prerequisites:

Install Ollama for local AI inference:

   # macOS/Linux
   curl -fsSL https://ollama.ai/install.sh | sh
   # Windows: Download from https://ollama.ai/download

Start Ollama and pull a model (e.g., Llama 3.1):

   ollama start
   ollama pull llama3.1
   ollama run llama3.1

Install Python Dependencies:

   python3 -m venv venv
   source venv/bin/activate
   pip install spacy jsonlines requests tqdm PyPDF2 psutil textacy scikit-learn pynvml
   python -m spacy download en_core_web_sm

Basic Usage:

python history_to_dataset.py your_book.pdf output_dataset.jsonl

Advanced Usage:

python history_to_dataset.py your_book.pdf output_dataset.jsonl --model-name mistral --max-workers 8 --start-chunk 50

Check out the full README for detailed setup, customization, and troubleshooting: Github link.

Why I Built This

As someone passionate about SA history and AI, I wanted a tool that bridges the gap between rich historical texts and modern AI applications. I'm also in the process of trying out model finetuning so I need a tool like this to extract data from a couple of pdf documents.

An example dataset I have managed to generate can be found on huggingface

It's based on this book

This tool not only makes it easy to create datasets for fine-tuning but also opens the door to adapting it for other domains like literature, science, or even fiction!

Get Involved!

I’d love for you to try it out, share feedback, or contribute ideas. The project is still fairly new and has room for improvement, please comment here or create a feature on Github should you notice anything not okay.

AI #NLP #Python #History #DataScience #FineTuning

DEV Community