Building a PDF Structured Data Extractor with Python
In this tutorial, we’ll build a desktop app using Python that extracts structured text from PDFs, including scanned documents, and exports the results to a CSV. The app ignores handwriting using OCR confidence filtering.
GitHub repo: PDFStructuredDataExtractor
We’ll go step by step, explaining every section of the code.
Step 1: Setup Your Environment
We’ll need some Python libraries. Install them using pip:
pip install tkinter ttkbootstrap pdfplumber pytesseract pdf2image pillow pandas
Note: You also need Tesseract OCR installed on your system. Download it from Tesseract GitHub
.
Step 2: Import Required Modules
We start by importing all necessary Python modules:
import os
import threading
import tkinter as tk
from tkinter import filedialog, messagebox
import tkinter.ttk as ttk
import ttkbootstrap as tb
import time
import concurrent.futures
import pandas as pd
import pdfplumber
import pytesseract
from pdf2image import convert_from_path
from PIL import Image
Explanation:
tkinter & ttkbootstrap: GUI components and styling.
pdfplumber: Extract text from PDFs.
pytesseract & pdf2image: OCR for scanned PDFs.
concurrent.futures & threading: Process PDFs in parallel.
pandas: Export extracted data to CSV.
Step 3: Define App Info
APP_NAME = "PDF Structured Data Extractor"
APP_VERSION = "1.0"
APP_AUTHOR = "Mate Technologies"
APP_WEBSITE = "https://matetools.gumroad.com"
This is optional, but good for the “About” section in the GUI.
Step 4: Create the Main App Class
The PDFExtractorApp class manages everything: GUI, folder scanning, PDF processing, and CSV export.
class PDFExtractorApp:
def __init__(self, master):
self.master = master
self.master.title(f"{APP_NAME} {APP_VERSION}")
self.master.geometry("1000x650")
self.style = tb.Style(theme="superhero")
master: The Tkinter root window.
tb.Style(theme="superhero"): Applies a dark GUI theme.
Step 4a: Variables
We’ll keep track of user choices, progress, and control flags:
self.folder_var = tk.StringVar()
self.status_var = tk.StringVar(value="Idle")
self.progress_val = tk.DoubleVar(value=0)
self.stop_event = threading.Event()
self.pause_event = threading.Event()
self.results = []
stop_event and pause_event allow pausing/stopping scans safely.
Step 4b: Build the GUI
def create_ui(self):
tb.Label(self.master, text=f"{APP_NAME}", font=("Segoe UI", 18, "bold")).pack(pady=(10, 2))
tb.Label(
self.master,
text="Extract structured text data from scanned PDFs (ignores handwriting)",
font=("Segoe UI", 10, "italic"),
foreground="#9ca3af"
).pack(pady=(0, 10))
This adds a title and subtitle.
Step 4c: Add Buttons
top_frame = tb.Frame(self.master, padding=10)
top_frame.pack(fill="x")
tb.Button(top_frame, text="Select Folder", bootstyle="primary", command=self.select_folder).pack(side="left", padx=5)
tb.Button(top_frame, text="Export to CSV", bootstyle="success", command=self.export_csv).pack(side="left", padx=5)
tb.Button(top_frame, text="Pause", bootstyle="secondary", command=self.pause_scan).pack(side="left", padx=5)
tb.Button(top_frame, text="Resume", bootstyle="info", command=self.resume_scan).pack(side="left", padx=5)
tb.Button(top_frame, text="Stop", bootstyle="danger", command=self.stop_scan).pack(side="left", padx=5)
tb.Button(top_frame, text="About", bootstyle="warning", command=self.show_about).pack(side="right", padx=5)
Each button calls a function to perform an action like scanning or exporting.
Step 4d: Add Table & Progress Bar
table_frame = tb.Frame(self.master)
table_frame.pack(fill="both", expand=True, padx=10, pady=10)
self.tree = ttk.Treeview(table_frame, columns=("file", "pages", "characters"), show="headings")
self.tree.heading("file", text="File Name")
self.tree.heading("pages", text="Pages")
self.tree.heading("characters", text="Extracted Characters")
self.tree.pack(side="left", fill="both", expand=True)
scrollbar = ttk.Scrollbar(table_frame, orient="vertical", command=self.tree.yview)
self.tree.configure(yscrollcommand=scrollbar.set)
scrollbar.pack(side="right", fill="y")
self.progress_bar = tb.Progressbar(self.master, variable=self.progress_val, maximum=100)
self.progress_bar.pack(fill="x", padx=10, pady=5)
Explanation:
Treeview: Displays scanned files and results.
Progressbar: Shows the progress of the scan.
Step 5: Button Functions
def select_folder(self):
folder = filedialog.askdirectory()
if folder:
self.start_scan(folder)
def pause_scan(self):
self.pause_event.set()
self.status_var.set("Paused...")
def resume_scan(self):
self.pause_event.clear()
self.status_var.set("Resuming...")
def stop_scan(self):
self.stop_event.set()
self.status_var.set("Stopping...")
def show_about(self):
messagebox.showinfo(
f"About {APP_NAME}",
f"{APP_NAME} v{APP_VERSION}\nExtract structured data from PDFs.\n{APP_AUTHOR}\n{APP_WEBSITE}"
)
pause_event & stop_event are used in the scanning thread.
Step 6: Scanning PDFs
def start_scan(self, folder):
self.stop_event.clear()
self.pause_event.clear()
self.results.clear()
self.tree.delete(*self.tree.get_children())
self.progress_val.set(0)
threading.Thread(target=self.scan_folder_thread, args=(folder,), daemon=True).start()
Launches a background thread to scan PDFs without freezing the GUI.
Step 6a: Process Each PDF
def process_pdf(self, path):
file_name = os.path.basename(path)
extracted_text = ""
page_count = 0
try:
with pdfplumber.open(path) as pdf:
page_count = len(pdf.pages)
for page in pdf.pages:
text = page.extract_text()
if text:
extracted_text += text + "\n"
if not extracted_text.strip():
images = convert_from_path(path, dpi=300)
for img in images:
ocr_data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT)
for i, word in enumerate(ocr_data["text"]):
conf = int(ocr_data["conf"][i])
if conf > 70:
extracted_text += word + " "
except Exception as e:
return (file_name, 0, 0)
char_count = len(extracted_text)
return (file_name, page_count, char_count)
First tries digital text extraction.
If none found, uses OCR but ignores handwriting (low-confidence words).
Step 6b: Scan Folder with ThreadPool
def scan_folder_thread(self, folder):
self.status_var.set("Scanning PDFs...")
pdf_files = [os.path.join(root, f) for root, _, files in os.walk(folder) for f in files if f.lower().endswith(".pdf")]
total = len(pdf_files)
processed = 0
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
futures = {executor.submit(self.process_pdf, pdf): pdf for pdf in pdf_files}
for future in concurrent.futures.as_completed(futures):
while self.pause_event.is_set():
time.sleep(0.2)
if self.stop_event.is_set():
return
result = future.result()
self.results.append(result)
self.master.after(0, lambda r=result: self.update_tree(r))
processed += 1
progress = (processed / total) * 100
self.master.after(0, lambda p=progress: self.progress_val.set(p))
self.master.after(0, lambda: self.status_var.set(f"Completed: {total} PDFs processed."))
Uses ThreadPoolExecutor to scan PDFs in parallel.
Updates the GUI using master.after to avoid thread conflicts.
Step 7: Update Treeview
def update_tree(self, result):
file_name, pages, chars = result
self.tree.insert("", "end", values=(file_name, pages, chars))
Adds each scanned PDF to the table.
Step 8: Export to CSV
def export_csv(self):
if not self.results:
messagebox.showwarning("No Data", "No extracted data to export.")
return
save_path = filedialog.asksaveasfilename(defaultextension=".csv")
if not save_path:
return
df = pd.DataFrame(self.results, columns=["File Name", "Pages", "Extracted Characters"])
df.to_csv(save_path, index=False)
messagebox.showinfo("Export Complete", "Data exported successfully.")
Converts results into a Pandas DataFrame and saves it.
Step 9: Run the App
if __name__ == "__main__":
root = tk.Tk()
app = PDFExtractorApp(root)
root.mainloop()
Launches the Tkinter GUI.
✅ Congratulations! You now have a fully functional PDF Structured Data Extractor that:
Scans PDFs in a folder
Extracts digital text or uses OCR for scanned PDFs
Ignores handwriting
Displays results in a table
Exports data to CSV

Top comments (2)
This is exactly what I've been looking for! PDF data extraction is such a common pain point, and your tutorial makes it so approachable. The OCR confidence filtering to ignore handwriting is a brilliant touch. Thanks for sharing the GitHub repo too — cloning it right now! 🙌
Really appreciate that — thank you! 🙌
PDF extraction can definitely be messy, so I’m glad the structure + OCR confidence filtering helped make it clearer.
Let me know how it works for your use case — always curious to see how others adapt it.