Paul Jacobs

Posted on Mar 6

How I Built a Privacy-First, Automated Tech News Aggregator in PHP (And Why I Ditched the Heavy Frameworks)

#webdev #php #buildinpublic

I was tired of tech news sites that track everything you click, load slowly, and hide half their content behind paywalls. So I built my own.

The result is PulseTech.news — a lightning-fast, automated tech news aggregator that updates every hour, covers 16 categories, and is fully GDPR/CCPA compliant with zero creepy tracking.

Here's exactly how I built it, the architectural decisions I made, and what I learned along the way.

The Stack (And Why I Chose It)

Before I get into the architecture, here's the full stack:

PHP 8.x — custom lightweight framework, no Laravel, no Symfony
MySQL 8.x — via PDO with prepared statements throughout
Tailwind CSS — standalone binary, no Node build pipeline
SimplePie — for RSS/Atom feed parsing
Composer — for dependency management (vlucas/phpdotenv, simplepie)

The biggest decision here was rejecting heavy frameworks. I didn't need the overhead of Laravel for what is essentially a read-heavy content site. A custom lightweight PHP framework gave me sub-100ms page loads and complete control over every byte hitting the wire.

Architecture: The Repository Pattern

The core of PulseTech.news is built around the Repository Pattern. All data access is abstracted away from the page controllers and centralised in repository classes.

// Clean controller code — no SQL in sight
$articles = $articleRepo->getLatest($limit, $offset, $filters);

The main repositories are:

ArticleRepository — handles all article retrieval, including language filtering (English by default, Spanish available)
FeedRepository — manages feed sources and their language settings

Each repository receives a PDO instance via constructor injection, keeping the database logic contained and testable.

For database access itself, I used a Singleton pattern:

$pdo = Database::getInstance()->getConnection();

One connection, one point of access, consistent throughout the application.

The Scraper Engine

This was the most interesting part to build. The scraper (classes/Scraper.php) runs as a headless background process on an hourly cron cycle. Here's what it does:

1. Feed Management
RSS and Atom feeds from the world's top tech sources are managed via an admin panel. Adding a new source is a one-click operation.

2. Intelligent Categorisation
Rather than relying on the source's own tags (which are inconsistent), I built a weighted keyword detection system. Each article's title and description is scored against keyword sets for each category:

AI
Cybersecurity
Apple / iOS / iPadOS / iPhone / Mac
Android / Samsung
Linux
Windows
Gaming
Robots
Google / Tesla

The weighting system ensures "AI" news stays in AI, "Cybersecurity" stays in security, and articles don't bleed into the wrong categories. This took the most iteration to get right.

3. Deduplication
Articles are deduplicated on source URL before insertion. No duplicate stories, even when multiple feeds cover the same news.

The Shift from Vibe Coding to Agentic Engineering

I want to be honest about the build process here, because I think it matters.

The first version of PulseTech.news was largely vibe coded — prompting AI for code, tweaking until it worked, posting screenshots. The UI looked great. But the underlying system was fragile.

The real work came when I shifted to agentic engineering: designing structured workflows, context documents, validation loops, and a full architecture overview (PROJECT_ARCHITECTURE.md) that AI agents could operate within without breaking the build.

The difference was enormous. Instead of getting code that looked right, I got code that behaved correctly within the system. The Repository Pattern, the static helpers, the testing standards — all of it was designed so that an AI agent could contribute to the codebase following the same rules as a human developer.

Static Helper Classes

Rather than polluting controllers with raw superglobal access, I built a set of static helper classes:

Session::Get('user_id');      // Clean session access
Input::Get('page');           // Sanitised GET/POST input
Config::Get('DB_HOST');       // Environment variable access
UIHelper::ArticleCard($data); // Reusable UI components
Theme::isDark();              // Dark/light mode state

These keep the controllers clean and make the codebase easy for AI agents (and human developers) to navigate consistently.

SEO & Structured Data

Every listing on PulseTech.news is backed by JSON-LD Schema.org structured data, making the site highly discoverable. The header system manages:

Page-specific Open Graph and Twitter Card meta tags
Canonical URLs (auto-calculated pretty URLs)
JSON-LD Organization and WebSite schemas
Dynamic $pageTitle, $pageDescription, and $ogImage variables per page

This was a deliberate investment in long-term organic traffic. SEO is compounding — the work you do today pays off for months.

Privacy First

PulseTech.news implements Google Consent Mode v2 and PII-free click tracking. Here's what that means in practice:

No personal data is stored on click events
Full GDPR/CCPA compliance without sacrificing analytics
Consent banner with genuine reject option (not a dark pattern)

This wasn't just an ethical choice — it's increasingly a legal requirement and a genuine differentiator when users are increasingly privacy-conscious.

Security Standards

Every database interaction uses PDO prepared statements. No exceptions. All POST forms include CSRF tokens, and admin routes are protected via session-based authorisation checks.

// Always prepared statements — never raw interpolation
$stmt = $pdo->prepare("SELECT * FROM articles WHERE id = :id");
$stmt->execute([':id' => $id]);

Testing

The project uses PHPUnit for automated testing, located in tests/. Every Repository and Business Logic class has a corresponding test file. The convention is strict: ClassNameTest.php, bootstrapped via tests/bootstrap.php.

./vendor/bin/phpunit

Having a test suite was essential when using AI agents to contribute code — it gave me a fast feedback loop to catch regressions before they hit production.

What's Next

PulseTech.news is live and updating hourly. Here's what's on the roadmap:

User accounts with a personal 'Read Later' library
AI-driven personalised feeds — only see the categories and sources you care about
More sources and languages — currently English and Spanish, expanding soon