Mate Technologies

Posted on Apr 19

🔗 Build a Link Extractor & Broken Link Checker (Python + PySide6)

#python #beginners #tutorial #opensource

In this tutorial, we’ll build a desktop app that:

✅ Extracts links from files (.txt, .pdf, .html)
✅ Filters links (include/exclude keywords)
✅ Checks if links are broken
✅ Displays results with colors (🟢 working / 🔴 broken)
✅ Uses a modern GUI with PySide6

📦 Step 1: Install Dependencies

First, install required packages:

pip install PySide6 requests PyPDF2

🧠 Step 2: Import Required Libraries

We start by importing everything we need:

import os
import sys
import re
import requests
import time
import platform
import subprocess

from PySide6.QtWidgets import *
from PySide6.QtCore import Qt, QThread, Signal, QTimer
from PySide6.QtGui import QColor, QIcon, QGuiApplication

import PyPDF2

💡 Explanation:
os, re → file handling + regex
requests → check links
PySide6 → GUI framework
PyPDF2 → extract text from PDFs
🧵 Step 3: Create a Background Worker (QThread)

We use a thread so the UI doesn’t freeze while scanning.

class LinkWorker(QThread):
    found = Signal(str, bool)
    progress = Signal(int)
    finished = Signal()

💡 Why?

GUI apps must stay responsive, so heavy work runs in a thread.

🔍 Step 3.1: Initialize Worker

def __init__(self, folder, file_types, check_broken, include_words=None, exclude_words=None):
    super().__init__()
    self.folder = folder
    self.file_types = file_types
    self.check_broken = check_broken
    self.include_words = include_words or []
    self.exclude_words = exclude_words or []
    self.seen_links = set()
    self._running = True

💡 Features:
Avoid duplicate links
Support include/exclude filters
Allow stopping process
📂 Step 3.2: Scan Files

def run(self):
    all_files = []

    for root, _, files in os.walk(self.folder):
        for f in files:
            ext = os.path.splitext(f)[1].lower()

            if (ext == '.txt' and self.file_types['txt']) or \
               (ext == '.pdf' and self.file_types['pdf']) or \
               (ext in ['.html', '.htm'] and self.file_types['html']):
                all_files.append(os.path.join(root, f))

💡 What happens:
Recursively scans folders
Filters only selected file types
🔗 Step 3.3: Extract Links

urls = re.findall(r'https?://[^\s"\'>]+', text)

💡 Regex explained:
Matches http:// or https://
Stops at spaces or quotes
📄 Handle PDF Files

reader = PyPDF2.PdfReader(f)
for page in reader.pages:
    text = page.extract_text()

🎯 Step 3.4: Apply Filters

if self.include_words and not any(w in url for w in self.include_words):
    continue

if self.exclude_words and any(w in url for w in self.exclude_words):
    continue

💡 Example:
Include: google
Exclude: facebook
🌐 Step 3.5: Check Broken Links

def check_link(self, url):
    try:
        res = requests.get(url, timeout=10)
        return not (200 <= res.status_code < 400)
    except:
        return True

💡 Logic:
200–399 → OK
400+ → broken
🖥️ Step 4: Build the GUI

Create the main window:

class LinkApp(QWidget):
    def __init__(self):
        super().__init__()
        self.setWindowTitle("LinkGuardian")
        self.setMinimumSize(1000, 600)

📁 Step 4.1: Folder Selection

self.path_input = QLineEdit()
self.path_input.setReadOnly(True)

browse_btn = QPushButton("Browse")
browse_btn.clicked.connect(self.browse_folder)
def browse_folder(self):
    folder = QFileDialog.getExistingDirectory(self)
    if folder:
        self.path_input.setText(folder)
        self.folder = folder

⚙️ Step 4.2: Options (Checkboxes)

self.txt_checkbox = QCheckBox(".txt")
self.pdf_checkbox = QCheckBox(".pdf")
self.html_checkbox = QCheckBox(".html")

self.check_broken_checkbox = QCheckBox("Check Broken Links")

🔍 Step 4.3: Filters

self.include_input = QLineEdit()
self.include_input.setPlaceholderText("Include words")

self.exclude_input = QLineEdit()
self.exclude_input.setPlaceholderText("Exclude words")

▶️ Step 4.4: Start Scan

def start_scan(self):
    self.worker = LinkWorker(
        self.folder,
        {
            'txt': self.txt_checkbox.isChecked(),
            'pdf': self.pdf_checkbox.isChecked(),
            'html': self.html_checkbox.isChecked()
        },
        self.check_broken_checkbox.isChecked(),
        self.include_input.text().split(","),
        self.exclude_input.text().split(",")
    )

    self.worker.found.connect(self.add_link)
    self.worker.start()

🎨 Step 5: Display Results

def add_link(self, link, is_broken):
    item = QListWidgetItem(link)

    color = QColor("red") if is_broken else QColor("green")
    item.setForeground(color)

    self.results_list.addItem(item)

💡 Result:
🟢 Green → Working link
🔴 Red → Broken link
📊 Step 6: Progress Bar

self.progress_bar = QProgressBar()
self.progress_bar.setMaximum(100)

Update it from the worker:

self.worker.progress.connect(self.progress_bar.setValue)
📋 Step 7: Copy All Links

def copy_all_links(self):
    links = "\n".join(
        self.results_list.item(i).text()
        for i in range(self.results_list.count())
    )

    QGuiApplication.clipboard().setText(links)

🌍 Step 8: Open Links on Double Click

def open_item(self, item):
    url = item.text()

    if platform.system() == "Windows":
        os.startfile(url)
    else:
        subprocess.Popen(["xdg-open", url])

🚀 Step 9: Run the App

if __name__ == "__main__":
    app = QApplication(sys.argv)
    window = LinkApp()
    window.show()
    sys.exit(app.exec())

🎉 Final Result

You now have a professional desktop tool that:

✔ Extracts links from files
✔ Filters intelligently
✔ Detects broken links
✔ Displays results beautifully
✔ Runs smoothly with threads

💡 Bonus Ideas

Want to upgrade it further?

Export results to CSV
Add domain grouping
Add link preview
Add multi-threaded link checking (faster 🚀)

DEV Community

🔗 Build a Link Extractor & Broken Link Checker (Python + PySide6)

Top comments (0)