Mate Technologies

Posted on Feb 13

Building a PDF Structured Data Extractor with Python

#tutorial #beginners #python #pdf

In this tutorial, we’ll build a desktop app using Python that extracts structured text from PDFs, including scanned documents, and exports the results to a CSV. The app ignores handwriting using OCR confidence filtering.

GitHub repo: PDFStructuredDataExtractor

We’ll go step by step, explaining every section of the code.

Step 1: Setup Your Environment

We’ll need some Python libraries. Install them using pip:

pip install tkinter ttkbootstrap pdfplumber pytesseract pdf2image pillow pandas

Note: You also need Tesseract OCR installed on your system. Download it from Tesseract GitHub
.

Step 2: Import Required Modules

We start by importing all necessary Python modules:

import os
import threading
import tkinter as tk
from tkinter import filedialog, messagebox
import tkinter.ttk as ttk
import ttkbootstrap as tb
import time
import concurrent.futures
import pandas as pd
import pdfplumber
import pytesseract
from pdf2image import convert_from_path
from PIL import Image

Explanation:

tkinter & ttkbootstrap: GUI components and styling.

pdfplumber: Extract text from PDFs.

pytesseract & pdf2image: OCR for scanned PDFs.

concurrent.futures & threading: Process PDFs in parallel.

pandas: Export extracted data to CSV.

Step 3: Define App Info

APP_NAME = "PDF Structured Data Extractor"
APP_VERSION = "1.0"
APP_AUTHOR = "Mate Technologies"
APP_WEBSITE = "https://matetools.gumroad.com"

This is optional, but good for the “About” section in the GUI.

Step 4: Create the Main App Class

The PDFExtractorApp class manages everything: GUI, folder scanning, PDF processing, and CSV export.

class PDFExtractorApp:

    def __init__(self, master):
        self.master = master
        self.master.title(f"{APP_NAME} {APP_VERSION}")
        self.master.geometry("1000x650")
        self.style = tb.Style(theme="superhero")

master: The Tkinter root window.

tb.Style(theme="superhero"): Applies a dark GUI theme.

Step 4a: Variables

We’ll keep track of user choices, progress, and control flags:

        self.folder_var = tk.StringVar()
        self.status_var = tk.StringVar(value="Idle")
        self.progress_val = tk.DoubleVar(value=0)
        self.stop_event = threading.Event()
        self.pause_event = threading.Event()
        self.results = []

stop_event and pause_event allow pausing/stopping scans safely.

Step 4b: Build the GUI

    def create_ui(self):
        tb.Label(self.master, text=f"{APP_NAME}", font=("Segoe UI", 18, "bold")).pack(pady=(10, 2))
        tb.Label(
            self.master,
            text="Extract structured text data from scanned PDFs (ignores handwriting)",
            font=("Segoe UI", 10, "italic"),
            foreground="#9ca3af"
        ).pack(pady=(0, 10))

This adds a title and subtitle.

Step 4c: Add Buttons

        top_frame = tb.Frame(self.master, padding=10)
        top_frame.pack(fill="x")

        tb.Button(top_frame, text="Select Folder", bootstyle="primary", command=self.select_folder).pack(side="left", padx=5)
        tb.Button(top_frame, text="Export to CSV", bootstyle="success", command=self.export_csv).pack(side="left", padx=5)
        tb.Button(top_frame, text="Pause", bootstyle="secondary", command=self.pause_scan).pack(side="left", padx=5)
        tb.Button(top_frame, text="Resume", bootstyle="info", command=self.resume_scan).pack(side="left", padx=5)
        tb.Button(top_frame, text="Stop", bootstyle="danger", command=self.stop_scan).pack(side="left", padx=5)
        tb.Button(top_frame, text="About", bootstyle="warning", command=self.show_about).pack(side="right", padx=5)

Each button calls a function to perform an action like scanning or exporting.

Step 4d: Add Table & Progress Bar

        table_frame = tb.Frame(self.master)
        table_frame.pack(fill="both", expand=True, padx=10, pady=10)

        self.tree = ttk.Treeview(table_frame, columns=("file", "pages", "characters"), show="headings")
        self.tree.heading("file", text="File Name")
        self.tree.heading("pages", text="Pages")
        self.tree.heading("characters", text="Extracted Characters")
        self.tree.pack(side="left", fill="both", expand=True)

        scrollbar = ttk.Scrollbar(table_frame, orient="vertical", command=self.tree.yview)
        self.tree.configure(yscrollcommand=scrollbar.set)
        scrollbar.pack(side="right", fill="y")

        self.progress_bar = tb.Progressbar(self.master, variable=self.progress_val, maximum=100)
        self.progress_bar.pack(fill="x", padx=10, pady=5)

Explanation:

Treeview: Displays scanned files and results.

Progressbar: Shows the progress of the scan.

Step 5: Button Functions

    def select_folder(self):
        folder = filedialog.askdirectory()
        if folder:
            self.start_scan(folder)

    def pause_scan(self):
        self.pause_event.set()
        self.status_var.set("Paused...")

    def resume_scan(self):
        self.pause_event.clear()
        self.status_var.set("Resuming...")

    def stop_scan(self):
        self.stop_event.set()
        self.status_var.set("Stopping...")

    def show_about(self):
        messagebox.showinfo(
            f"About {APP_NAME}",
            f"{APP_NAME} v{APP_VERSION}\nExtract structured data from PDFs.\n{APP_AUTHOR}\n{APP_WEBSITE}"
        )

pause_event & stop_event are used in the scanning thread.

Step 6: Scanning PDFs

    def start_scan(self, folder):
        self.stop_event.clear()
        self.pause_event.clear()
        self.results.clear()
        self.tree.delete(*self.tree.get_children())
        self.progress_val.set(0)
        threading.Thread(target=self.scan_folder_thread, args=(folder,), daemon=True).start()

Launches a background thread to scan PDFs without freezing the GUI.

Step 6a: Process Each PDF

    def process_pdf(self, path):
        file_name = os.path.basename(path)
        extracted_text = ""
        page_count = 0

        try:
            with pdfplumber.open(path) as pdf:
                page_count = len(pdf.pages)
                for page in pdf.pages:
                    text = page.extract_text()
                    if text:
                        extracted_text += text + "\n"

            if not extracted_text.strip():
                images = convert_from_path(path, dpi=300)
                for img in images:
                    ocr_data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT)
                    for i, word in enumerate(ocr_data["text"]):
                        conf = int(ocr_data["conf"][i])
                        if conf > 70:
                            extracted_text += word + " "

        except Exception as e:
            return (file_name, 0, 0)

        char_count = len(extracted_text)
        return (file_name, page_count, char_count)

First tries digital text extraction.

If none found, uses OCR but ignores handwriting (low-confidence words).

Step 6b: Scan Folder with ThreadPool

    def scan_folder_thread(self, folder):
        self.status_var.set("Scanning PDFs...")
        pdf_files = [os.path.join(root, f) for root, _, files in os.walk(folder) for f in files if f.lower().endswith(".pdf")]
        total = len(pdf_files)
        processed = 0

        with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
            futures = {executor.submit(self.process_pdf, pdf): pdf for pdf in pdf_files}
            for future in concurrent.futures.as_completed(futures):
                while self.pause_event.is_set():
                    time.sleep(0.2)
                if self.stop_event.is_set():
                    return

                result = future.result()
                self.results.append(result)
                self.master.after(0, lambda r=result: self.update_tree(r))

                processed += 1
                progress = (processed / total) * 100
                self.master.after(0, lambda p=progress: self.progress_val.set(p))

        self.master.after(0, lambda: self.status_var.set(f"Completed: {total} PDFs processed."))

Uses ThreadPoolExecutor to scan PDFs in parallel.

Updates the GUI using master.after to avoid thread conflicts.

Step 7: Update Treeview

    def update_tree(self, result):
        file_name, pages, chars = result
        self.tree.insert("", "end", values=(file_name, pages, chars))

Adds each scanned PDF to the table.

Step 8: Export to CSV

    def export_csv(self):
        if not self.results:
            messagebox.showwarning("No Data", "No extracted data to export.")
            return

        save_path = filedialog.asksaveasfilename(defaultextension=".csv")
        if not save_path:
            return

        df = pd.DataFrame(self.results, columns=["File Name", "Pages", "Extracted Characters"])
        df.to_csv(save_path, index=False)
        messagebox.showinfo("Export Complete", "Data exported successfully.")

Converts results into a Pandas DataFrame and saves it.

Step 9: Run the App

if __name__ == "__main__":
    root = tk.Tk()
    app = PDFExtractorApp(root)
    root.mainloop()

Launches the Tkinter GUI.

✅ Congratulations! You now have a fully functional PDF Structured Data Extractor that:

Scans PDFs in a folder

Extracts digital text or uses OCR for scanned PDFs

Ignores handwriting

Displays results in a table

Exports data to CSV

Top comments (2)

Harsh • Feb 13

This is exactly what I've been looking for! PDF data extraction is such a common pain point, and your tutorial makes it so approachable. The OCR confidence filtering to ignore handwriting is a brilliant touch. Thanks for sharing the GitHub repo too — cloning it right now! 🙌

Mate Technologies • Feb 14

Really appreciate that — thank you! 🙌
PDF extraction can definitely be messy, so I’m glad the structure + OCR confidence filtering helped make it clearer.
Let me know how it works for your use case — always curious to see how others adapt it.