DEV Community

Mate Technologies
Mate Technologies

Posted on

Building a PDF Structured Data Extractor with Python

Building a PDF Structured Data Extractor with Python

In this tutorial, we’ll build a desktop app using Python that extracts structured text from PDFs, including scanned documents, and exports the results to a CSV. The app ignores handwriting using OCR confidence filtering.

GitHub repo: PDFStructuredDataExtractor

We’ll go step by step, explaining every section of the code.

Step 1: Setup Your Environment

We’ll need some Python libraries. Install them using pip:

pip install tkinter ttkbootstrap pdfplumber pytesseract pdf2image pillow pandas
Enter fullscreen mode Exit fullscreen mode

Note: You also need Tesseract OCR installed on your system. Download it from Tesseract GitHub
.

Step 2: Import Required Modules

We start by importing all necessary Python modules:

import os
import threading
import tkinter as tk
from tkinter import filedialog, messagebox
import tkinter.ttk as ttk
import ttkbootstrap as tb
import time
import concurrent.futures
import pandas as pd
import pdfplumber
import pytesseract
from pdf2image import convert_from_path
from PIL import Image
Enter fullscreen mode Exit fullscreen mode

Explanation:

tkinter & ttkbootstrap: GUI components and styling.

pdfplumber: Extract text from PDFs.

pytesseract & pdf2image: OCR for scanned PDFs.

concurrent.futures & threading: Process PDFs in parallel.

pandas: Export extracted data to CSV.

Step 3: Define App Info

APP_NAME = "PDF Structured Data Extractor"
APP_VERSION = "1.0"
APP_AUTHOR = "Mate Technologies"
APP_WEBSITE = "https://matetools.gumroad.com"
Enter fullscreen mode Exit fullscreen mode

This is optional, but good for the “About” section in the GUI.

Step 4: Create the Main App Class

The PDFExtractorApp class manages everything: GUI, folder scanning, PDF processing, and CSV export.

class PDFExtractorApp:

    def __init__(self, master):
        self.master = master
        self.master.title(f"{APP_NAME} {APP_VERSION}")
        self.master.geometry("1000x650")
        self.style = tb.Style(theme="superhero")
Enter fullscreen mode Exit fullscreen mode

master: The Tkinter root window.

tb.Style(theme="superhero"): Applies a dark GUI theme.

Step 4a: Variables

We’ll keep track of user choices, progress, and control flags:

        self.folder_var = tk.StringVar()
        self.status_var = tk.StringVar(value="Idle")
        self.progress_val = tk.DoubleVar(value=0)
        self.stop_event = threading.Event()
        self.pause_event = threading.Event()
        self.results = []
Enter fullscreen mode Exit fullscreen mode

stop_event and pause_event allow pausing/stopping scans safely.

Step 4b: Build the GUI

    def create_ui(self):
        tb.Label(self.master, text=f"{APP_NAME}", font=("Segoe UI", 18, "bold")).pack(pady=(10, 2))
        tb.Label(
            self.master,
            text="Extract structured text data from scanned PDFs (ignores handwriting)",
            font=("Segoe UI", 10, "italic"),
            foreground="#9ca3af"
        ).pack(pady=(0, 10))
Enter fullscreen mode Exit fullscreen mode

This adds a title and subtitle.

Step 4c: Add Buttons

        top_frame = tb.Frame(self.master, padding=10)
        top_frame.pack(fill="x")

        tb.Button(top_frame, text="Select Folder", bootstyle="primary", command=self.select_folder).pack(side="left", padx=5)
        tb.Button(top_frame, text="Export to CSV", bootstyle="success", command=self.export_csv).pack(side="left", padx=5)
        tb.Button(top_frame, text="Pause", bootstyle="secondary", command=self.pause_scan).pack(side="left", padx=5)
        tb.Button(top_frame, text="Resume", bootstyle="info", command=self.resume_scan).pack(side="left", padx=5)
        tb.Button(top_frame, text="Stop", bootstyle="danger", command=self.stop_scan).pack(side="left", padx=5)
        tb.Button(top_frame, text="About", bootstyle="warning", command=self.show_about).pack(side="right", padx=5)
Enter fullscreen mode Exit fullscreen mode

Each button calls a function to perform an action like scanning or exporting.

Step 4d: Add Table & Progress Bar

        table_frame = tb.Frame(self.master)
        table_frame.pack(fill="both", expand=True, padx=10, pady=10)

        self.tree = ttk.Treeview(table_frame, columns=("file", "pages", "characters"), show="headings")
        self.tree.heading("file", text="File Name")
        self.tree.heading("pages", text="Pages")
        self.tree.heading("characters", text="Extracted Characters")
        self.tree.pack(side="left", fill="both", expand=True)

        scrollbar = ttk.Scrollbar(table_frame, orient="vertical", command=self.tree.yview)
        self.tree.configure(yscrollcommand=scrollbar.set)
        scrollbar.pack(side="right", fill="y")

        self.progress_bar = tb.Progressbar(self.master, variable=self.progress_val, maximum=100)
        self.progress_bar.pack(fill="x", padx=10, pady=5)
Enter fullscreen mode Exit fullscreen mode

Explanation:

Treeview: Displays scanned files and results.

Progressbar: Shows the progress of the scan.

Step 5: Button Functions

    def select_folder(self):
        folder = filedialog.askdirectory()
        if folder:
            self.start_scan(folder)

    def pause_scan(self):
        self.pause_event.set()
        self.status_var.set("Paused...")

    def resume_scan(self):
        self.pause_event.clear()
        self.status_var.set("Resuming...")

    def stop_scan(self):
        self.stop_event.set()
        self.status_var.set("Stopping...")

    def show_about(self):
        messagebox.showinfo(
            f"About {APP_NAME}",
            f"{APP_NAME} v{APP_VERSION}\nExtract structured data from PDFs.\n{APP_AUTHOR}\n{APP_WEBSITE}"
        )
Enter fullscreen mode Exit fullscreen mode

pause_event & stop_event are used in the scanning thread.

Step 6: Scanning PDFs

    def start_scan(self, folder):
        self.stop_event.clear()
        self.pause_event.clear()
        self.results.clear()
        self.tree.delete(*self.tree.get_children())
        self.progress_val.set(0)
        threading.Thread(target=self.scan_folder_thread, args=(folder,), daemon=True).start()
Enter fullscreen mode Exit fullscreen mode

Launches a background thread to scan PDFs without freezing the GUI.

Step 6a: Process Each PDF

    def process_pdf(self, path):
        file_name = os.path.basename(path)
        extracted_text = ""
        page_count = 0

        try:
            with pdfplumber.open(path) as pdf:
                page_count = len(pdf.pages)
                for page in pdf.pages:
                    text = page.extract_text()
                    if text:
                        extracted_text += text + "\n"

            if not extracted_text.strip():
                images = convert_from_path(path, dpi=300)
                for img in images:
                    ocr_data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT)
                    for i, word in enumerate(ocr_data["text"]):
                        conf = int(ocr_data["conf"][i])
                        if conf > 70:
                            extracted_text += word + " "

        except Exception as e:
            return (file_name, 0, 0)

        char_count = len(extracted_text)
        return (file_name, page_count, char_count)
Enter fullscreen mode Exit fullscreen mode

First tries digital text extraction.

If none found, uses OCR but ignores handwriting (low-confidence words).

Step 6b: Scan Folder with ThreadPool

    def scan_folder_thread(self, folder):
        self.status_var.set("Scanning PDFs...")
        pdf_files = [os.path.join(root, f) for root, _, files in os.walk(folder) for f in files if f.lower().endswith(".pdf")]
        total = len(pdf_files)
        processed = 0

        with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
            futures = {executor.submit(self.process_pdf, pdf): pdf for pdf in pdf_files}
            for future in concurrent.futures.as_completed(futures):
                while self.pause_event.is_set():
                    time.sleep(0.2)
                if self.stop_event.is_set():
                    return

                result = future.result()
                self.results.append(result)
                self.master.after(0, lambda r=result: self.update_tree(r))

                processed += 1
                progress = (processed / total) * 100
                self.master.after(0, lambda p=progress: self.progress_val.set(p))

        self.master.after(0, lambda: self.status_var.set(f"Completed: {total} PDFs processed."))
Enter fullscreen mode Exit fullscreen mode

Uses ThreadPoolExecutor to scan PDFs in parallel.

Updates the GUI using master.after to avoid thread conflicts.

Step 7: Update Treeview

    def update_tree(self, result):
        file_name, pages, chars = result
        self.tree.insert("", "end", values=(file_name, pages, chars))

Enter fullscreen mode Exit fullscreen mode

Adds each scanned PDF to the table.

Step 8: Export to CSV

    def export_csv(self):
        if not self.results:
            messagebox.showwarning("No Data", "No extracted data to export.")
            return

        save_path = filedialog.asksaveasfilename(defaultextension=".csv")
        if not save_path:
            return

        df = pd.DataFrame(self.results, columns=["File Name", "Pages", "Extracted Characters"])
        df.to_csv(save_path, index=False)
        messagebox.showinfo("Export Complete", "Data exported successfully.")
Enter fullscreen mode Exit fullscreen mode

Converts results into a Pandas DataFrame and saves it.

Step 9: Run the App

if __name__ == "__main__":
    root = tk.Tk()
    app = PDFExtractorApp(root)
    root.mainloop()
Enter fullscreen mode Exit fullscreen mode

Launches the Tkinter GUI.

✅ Congratulations! You now have a fully functional PDF Structured Data Extractor that:

Scans PDFs in a folder

Extracts digital text or uses OCR for scanned PDFs

Ignores handwriting

Displays results in a table

Exports data to CSV

Top comments (2)

Collapse
 
harsh2644 profile image
Harsh

This is exactly what I've been looking for! PDF data extraction is such a common pain point, and your tutorial makes it so approachable. The OCR confidence filtering to ignore handwriting is a brilliant touch. Thanks for sharing the GitHub repo too — cloning it right now! 🙌

Collapse
 
matetechnologie profile image
Mate Technologies

Really appreciate that — thank you! 🙌
PDF extraction can definitely be messy, so I’m glad the structure + OCR confidence filtering helped make it clearer.
Let me know how it works for your use case — always curious to see how others adapt it.