Web Crawling and RSS Reading Made Easy

Piotr — Fri, 31 Jan 2025 15:46:11 +0000

Tired of building yet another RSS client or web crawler?

Don't worry - Crawler Buddy is here to save the day! This project makes it easy to crawl web pages and return digestible responses in JSON format.

Key Features:

No more reliance on external tools: Forget about yt-dlp or Beautiful Soup for link metadata extraction.
Standardized metadata: Get consistent fields like title, description, date_published, and more.
Bot protection? No problem: Access RSS feeds—even on sites with tricky bot protection—without custom HTTP wrappers.
Automatic feed detection: It can automatically discover RSS feed URLs for websites and YouTube channels in many cases.
Simplified data handling: Skip parsing RSS files. Just consume easy-to-use JSON.
Unified interface: Access all metadata from a single, simple interface.
Containerized Docker environment: Isolate problems from your host OS for seamless operation.
Scalability: Whether you're running a single server or multiple, Crawler Buddy fits your needs.
UTF-8 encoding: Say goodbye to encoding issues—everything is in UTF.

Available Crawlers:

RequestsCrawler: Python requests
CrawleeScript: Crawlee with BeautifulSoup
PlaywrightScript: Crawlee with Playwright
SeleniumUndetected: Undetected Selenium
SeleniumChromeHeadless: Selenium in headless mode
SeleniumChromeFull: Full Selenium mode
StealthRequestsCrawler: Stealthy requests

Want to learn more?
Check out the official repository: Crawler Buddy GitHub

Django bookmark management software

Piotr — Tue, 29 Oct 2024 13:22:36 +0000

Overview

Two years ago, I started a personal project with a big goal: creating a truly complete RSS client. I know what you're probably thinking—aren't there already thousands of RSS clients out there? It's true, but I believe none of them have yet delivered the ultimate user experience.

Of course, there are some fantastic tools in the realm of bookmark managers and RSS clients, like the impressive Grimoire project. There's also a wealth of other resources on GitHub’s Awesome Selfhosted list.

After much trial and error, I realized what I truly wanted from a manager:

Self-hostable: No syncing across external platforms. I want my bookmarks secure and fully managed on my own server.
Scalable: It must handle thousands of bookmarks with ease.
Powerful search and tagging: With so many bookmarks, an efficient search and tagging system is essential.
Comment and note support: I need the ability to add detailed notes or context to each bookmark.
File over function: The ability to import/export in multiple formats is a must.
Open Source: I want full transparency, and I aim to prevent the "enshittification" that often creeps into closed systems.
Small footpring: I want it to run on Raspberry Pi, or small NAS

Looking at other RSS clients, I found that very few could meet my criteria. Many, in my opinion, fall short in features or flexibility.

Introducing Django-link-archive

I’ve developed most of these features in my project, Django-link-archive, which has become my primary tool for managing bookmarks. It’s transformed how I navigate content online—I control what I want to see and avoid the distractions pushed by social media algorithms.

Take a look if you’re interested:

Django-link-archive GitHub Repository

Seeking Feedback

Now, I'm looking for feedback. Are there other requirements you’d expect from a robust RSS client or bookmark manager? Any features you find especially useful?

I've already received insightful ideas from the Reddit community. For example, I recently added a kiosk-like feature where the list of entries refreshes periodically. I also integrated jQuery, making interactions much more fluid.

Additional Projects

As I continued to work with RSS data, I was able to build out some related repositories, such as:

In some ways, this project has evolved into a simplified web crawler. I’ve added options for changing "browser" mechanisms in the backend to include requests, Selenium, and Crawlee. This setup is entirely configurable through a GUI, so I can assign specific crawling methods to particular domains—for instance, Spotify might require a full Selenium browser, while Crawlee performs better with other domains.

Maintaining this ecosystem solo has been a lot, and things do occasionally break. Still, I’m excited to share this with the community and hear your thoughts!