DEV Community

Mate Technologies
Mate Technologies

Posted on

πŸ”— Build a Link Extractor & Broken Link Checker (Python + PySide6)

In this tutorial, we’ll build a desktop app that:

βœ… Extracts links from files (.txt, .pdf, .html)
βœ… Filters links (include/exclude keywords)
βœ… Checks if links are broken
βœ… Displays results with colors (🟒 working / πŸ”΄ broken)
βœ… Uses a modern GUI with PySide6

πŸ“¦ Step 1: Install Dependencies

First, install required packages:

pip install PySide6 requests PyPDF2
Enter fullscreen mode Exit fullscreen mode

🧠 Step 2: Import Required Libraries

We start by importing everything we need:

import os
import sys
import re
import requests
import time
import platform
import subprocess

from PySide6.QtWidgets import *
from PySide6.QtCore import Qt, QThread, Signal, QTimer
from PySide6.QtGui import QColor, QIcon, QGuiApplication

import PyPDF2
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Explanation:
os, re β†’ file handling + regex
requests β†’ check links
PySide6 β†’ GUI framework
PyPDF2 β†’ extract text from PDFs
🧡 Step 3: Create a Background Worker (QThread)

We use a thread so the UI doesn’t freeze while scanning.

class LinkWorker(QThread):
    found = Signal(str, bool)
    progress = Signal(int)
    finished = Signal()
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Why?

GUI apps must stay responsive, so heavy work runs in a thread.

πŸ” Step 3.1: Initialize Worker

def __init__(self, folder, file_types, check_broken, include_words=None, exclude_words=None):
    super().__init__()
    self.folder = folder
    self.file_types = file_types
    self.check_broken = check_broken
    self.include_words = include_words or []
    self.exclude_words = exclude_words or []
    self.seen_links = set()
    self._running = True
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Features:
Avoid duplicate links
Support include/exclude filters
Allow stopping process
πŸ“‚ Step 3.2: Scan Files

def run(self):
    all_files = []

    for root, _, files in os.walk(self.folder):
        for f in files:
            ext = os.path.splitext(f)[1].lower()

            if (ext == '.txt' and self.file_types['txt']) or \
               (ext == '.pdf' and self.file_types['pdf']) or \
               (ext in ['.html', '.htm'] and self.file_types['html']):
                all_files.append(os.path.join(root, f))
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ What happens:
Recursively scans folders
Filters only selected file types
πŸ”— Step 3.3: Extract Links

urls = re.findall(r'https?://[^\s"\'>]+', text)
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Regex explained:
Matches http:// or https://
Stops at spaces or quotes
πŸ“„ Handle PDF Files

reader = PyPDF2.PdfReader(f)
for page in reader.pages:
    text = page.extract_text()
Enter fullscreen mode Exit fullscreen mode

🎯 Step 3.4: Apply Filters

if self.include_words and not any(w in url for w in self.include_words):
    continue

if self.exclude_words and any(w in url for w in self.exclude_words):
    continue
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Example:
Include: google
Exclude: facebook
🌐 Step 3.5: Check Broken Links

def check_link(self, url):
    try:
        res = requests.get(url, timeout=10)
        return not (200 <= res.status_code < 400)
    except:
        return True
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Logic:
200–399 β†’ OK
400+ β†’ broken
πŸ–₯️ Step 4: Build the GUI

Create the main window:

class LinkApp(QWidget):
    def __init__(self):
        super().__init__()
        self.setWindowTitle("LinkGuardian")
        self.setMinimumSize(1000, 600)
Enter fullscreen mode Exit fullscreen mode

πŸ“ Step 4.1: Folder Selection

self.path_input = QLineEdit()
self.path_input.setReadOnly(True)

browse_btn = QPushButton("Browse")
browse_btn.clicked.connect(self.browse_folder)
def browse_folder(self):
    folder = QFileDialog.getExistingDirectory(self)
    if folder:
        self.path_input.setText(folder)
        self.folder = folder
Enter fullscreen mode Exit fullscreen mode

βš™οΈ Step 4.2: Options (Checkboxes)

self.txt_checkbox = QCheckBox(".txt")
self.pdf_checkbox = QCheckBox(".pdf")
self.html_checkbox = QCheckBox(".html")

self.check_broken_checkbox = QCheckBox("Check Broken Links")
Enter fullscreen mode Exit fullscreen mode

πŸ” Step 4.3: Filters

self.include_input = QLineEdit()
self.include_input.setPlaceholderText("Include words")

self.exclude_input = QLineEdit()
self.exclude_input.setPlaceholderText("Exclude words")
Enter fullscreen mode Exit fullscreen mode

▢️ Step 4.4: Start Scan

def start_scan(self):
    self.worker = LinkWorker(
        self.folder,
        {
            'txt': self.txt_checkbox.isChecked(),
            'pdf': self.pdf_checkbox.isChecked(),
            'html': self.html_checkbox.isChecked()
        },
        self.check_broken_checkbox.isChecked(),
        self.include_input.text().split(","),
        self.exclude_input.text().split(",")
    )

    self.worker.found.connect(self.add_link)
    self.worker.start()
Enter fullscreen mode Exit fullscreen mode

🎨 Step 5: Display Results

def add_link(self, link, is_broken):
    item = QListWidgetItem(link)

    color = QColor("red") if is_broken else QColor("green")
    item.setForeground(color)

    self.results_list.addItem(item)
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Result:
🟒 Green β†’ Working link
πŸ”΄ Red β†’ Broken link
πŸ“Š Step 6: Progress Bar

self.progress_bar = QProgressBar()
self.progress_bar.setMaximum(100)
Enter fullscreen mode Exit fullscreen mode

Update it from the worker:

self.worker.progress.connect(self.progress_bar.setValue)
πŸ“‹ Step 7: Copy All Links

def copy_all_links(self):
    links = "\n".join(
        self.results_list.item(i).text()
        for i in range(self.results_list.count())
    )

    QGuiApplication.clipboard().setText(links)
Enter fullscreen mode Exit fullscreen mode

🌍 Step 8: Open Links on Double Click

def open_item(self, item):
    url = item.text()

    if platform.system() == "Windows":
        os.startfile(url)
    else:
        subprocess.Popen(["xdg-open", url])
Enter fullscreen mode Exit fullscreen mode

πŸš€ Step 9: Run the App

if __name__ == "__main__":
    app = QApplication(sys.argv)
    window = LinkApp()
    window.show()
    sys.exit(app.exec())
Enter fullscreen mode Exit fullscreen mode

πŸŽ‰ Final Result

You now have a professional desktop tool that:

βœ” Extracts links from files
βœ” Filters intelligently
βœ” Detects broken links
βœ” Displays results beautifully
βœ” Runs smoothly with threads

πŸ’‘ Bonus Ideas

Want to upgrade it further?

Export results to CSV
Add domain grouping
Add link preview
Add multi-threaded link checking (faster πŸš€)

Top comments (0)