DEV Community

Harun Mahmud
Harun Mahmud

Posted on

I Built a Job Aggregator That Scrapes 10 Sources Every Morning — Here's How It Works

Why I Built It

Job hunting is exhausting — not because jobs don't exist, but because they're scattered everywhere. LinkedIn, Indeed, RemoteOK, We Work Remotely, Remotive… you end up checking six tabs every morning just to see what's new.

I wanted one place that quietly does that work for you. Browse it on the site, or just get a clean email digest every morning with only the roles that match what you're looking for.

That's DailyJobFeed.com — a read-only job aggregator. No job posting, no accounts, no applications. Just a fresh, filtered feed of real jobs updated daily.


How It Works

1. Scraping — every morning at 6 AM

A Python cron job runs daily and pulls listings from 10 sources. Each source uses a different method:

  • RSS feeds — We Work Remotely exposes 9 category feeds
  • Public JSON APIs — RemoteOK, Remotive, Arbeitnow, Himalayas, Jobicy
  • Keyed REST APIs — Adzuna (12 countries), USAJobs (US federal), Reed UK, The Muse

Each source is wrapped in its own module. If one fails, it's caught and the rest continue. Nothing blocks the run.

2. Normalization

Every source speaks a different language. Remotive calls it "Software Development". RemoteOK uses freeform tags. WWR uses RSS feed names. The scraper normalizes all of this into a consistent canonical schema:

{
  "slug":             make_slug(title, company, url),
  "title":            title,
  "company":          company,
  "category":         canonical_category(title, tags),  # Engineering, Design, Marketing…
  "experience_level": parse_experience(title),           # entry, mid, senior, lead
  "remote_type":      "remote" | "hybrid" | "on-site",
  "job_type":         "full-time" | "part-time" | "contract",
  "country":          ...,
  "salary_min":       ...,
  "salary_max":       ...,
  "source_url":       ...,
  "posted_at":        ...,
}
Enter fullscreen mode Exit fullscreen mode

canonical_category() is a keyword-matching function that maps any free-text label into one of ~20 fixed buckets. Getting this right — especially across sources that use completely different taxonomies — took more iteration than anything else in the project.

3. Deduplication

Jobs get cross-posted across sources constantly. Deduplication is slug-based: a SHA-256 hash of the source URL is truncated to 12 hex characters and appended to the slug. That gives 281 trillion unique combinations with no collision risk. The database has a UNIQUE constraint on source_url as a second safety net.

4. Storage

Jobs live in PostgreSQL. Only the last 6 months of listings are kept — a cleanup job runs before each scrape to purge anything older.

Job descriptions are sanitized with bleach before storage — dangerous tags (script, iframe, event handlers) are stripped and only safe formatting tags (p, ul, li, strong, etc.) are kept. Descriptions are capped at 50KB.

5. The Frontend

The site is built with Nuxt 3 (Vue 3, SSR, TypeScript). Server-side rendering matters here because every job page needs to be crawlable.

The job listing page supports filtering by:

  • Category (Engineering, Design, Marketing, Data, DevOps, etc.)
  • Work mode (Remote, Hybrid, On-site)
  • Experience level (Entry, Mid, Senior, Lead)
  • Country (40+ countries)
  • Job type (Full-time, Part-time, Contract)
  • Date posted (last 24h, 3 days, week, month)
  • Salary range
  • Visa sponsorship

Filters hit a Supabase API endpoint with faceted aggregation via a custom PostgreSQL RPC — so sidebar counts update accurately with every filter combination without expensive full-table scans.

One small detail: on the default unfiltered view, results are round-robin interleaved across all 9 scrapers so no single source dominates the top of the feed.

6. The Email Digest

Subscribers enter their email and set preferences — categories, work mode, send time. That's it. No password.

Every morning after the scraper finishes, the email sender:

  1. Queries all active subscribers
  2. Filters the last 24 hours of jobs against each subscriber's preferences
  3. Builds a digest — headline with total match count, 6 preview job cards, a CTA linking to the full filtered feed
  4. Sends via Resend API
  5. Updates last_emailed_at on each subscriber row

Unsubscribe is a single link click — no confirmation screen, no login. It sets is_active = FALSE. Rows are never deleted.


The Tech Stack

Layer Technology
Frontend Nuxt 3 (Vue 3, SSR, TypeScript)
Styling SCSS + CSS custom properties (no Tailwind)
Database PostgreSQL
DB hosting Supabase
Frontend hosting Vercel
Scraper Python
Cron runner Railway
Email Resend API

The Biggest Technical Challenge

Normalization. Every source has its own category system, experience labels, and remote-type conventions. Building canonical_category() — a reliable function that maps any combination of title, tags, and source category into one of ~20 fixed buckets — required reading through thousands of real job titles and edge cases.

A sample of what the mapping handles:

("data scientist",     "Data"),
("ml engineer",        "Data"),
("platform engineer",  "DevOps"),
("entwickler",         "Engineering"),  # German: developer
("product owner",      "Product"),
("devrel",             "Marketing"),
Enter fullscreen mode Exit fullscreen mode

Without this, filters would be useless — a "show me Engineering jobs" query would miss half the results because some source filed them under "Software Development" or "Tech".


What I Skipped (Intentionally)

  • No Indeed scraping — ToS risk, and the only reliable approach requires a paid proxy that wasn't in scope
  • No user accounts — subscribers only provide email + preferences. No passwords, no sessions, no JWTs
  • No job applications — every listing links directly to the original source. DailyJobFeed is read-only

If you're building something similar or have questions about any part of the architecture, happy to discuss in the comments.

🔗 dailyjobfeed.com

Top comments (0)