DEV Community

Ismat-Samadov
Ismat-Samadov

Posted on • Originally published at birjob.com

How I Built a Job Aggregator That Scrapes 80+ Sites Daily

Last year, job seekers in Azerbaijan had to check 10+ websites every morning. boss.az, hellojob.az, jobsearch.az, LinkedIn, plus dozens of company career pages. No one aggregated them.

So I built BirJob — a scraper that pulls from 80+ sources into one searchable platform.

Here's how it works under the hood.

The Architecture

GitHub Actions (cron, twice daily)
    ↓
80+ Python scrapers (aiohttp + BeautifulSoup)
    ↓
PostgreSQL on Neon (dedup via md5 hash)
    ↓
Next.js 14 on Vercel (SSR + API routes)
    ↓
Users search / get alerts via Email + Telegram
Enter fullscreen mode Exit fullscreen mode

The Scraper System

Each scraper extends a BaseScraper class:

class BaseScraper:
    async def fetch_url_async(self, url, session):
        # aiohttp with retry logic, rate limiting
        # returns HTML string or JSON dict

    def save_to_db(self, df):
        # pandas DataFrame → PostgreSQL
        # ON CONFLICT (apply_link) DO UPDATE
        # dedup_hash = md5(company + title)
Enter fullscreen mode Exit fullscreen mode

Most sites are simple HTML — BeautifulSoup handles them. A few are SPAs (Next.js, React) that need Playwright. Some use GraphQL APIs which are actually easier to scrape than HTML.

The hardest ones:

  • Cloudflare-protected sites — GitHub Actions IPs get blocked. Some I had to disable entirely.
  • Sites that change HTML monthly — CSS class hashes change every deploy. Switched to __NEXT_DATA__ JSON extraction for those.
  • Rate limiting — concurrency limited to 2 scrapers at a time for stability.

Deduplication

The same job appears on 3-4 boards with slightly different titles. I compute a dedup hash:

md5(lower(trim(company)) || '::' || lower(trim(title)))
Enter fullscreen mode Exit fullscreen mode

Stored as a column with an index. The search query uses DISTINCT ON this hash so users never see the same job twice.

The Numbers

  • 80+ sources scraped daily
  • 10,000+ active jobs
  • 30,000+ candidate profiles (scraped from CV boards)
  • 450+ blog articles
  • ~700 new jobs per weekday

What It Costs

Service Cost
Vercel Pro $20/mo
Neon Postgres $5/mo
Resend (email) Free tier
GitHub Actions Free
Cloudflare R2 ~$0.50/mo
Total ~$25/mo

Revenue

Sponsored job postings — HR departments pay 30-80 AZN ($18-47) to promote their listing. When they pay, we automatically email matching candidates from our database.

First paying customer: an HR manager who posted a Data Analyst role. We emailed 87 matching candidates within minutes.

What I'd Do Differently

  1. Start with fewer sources. I launched with 91 scrapers. Half broke within a month. Should have started with 20 reliable ones.

  2. Dedup earlier. I added deduplication 3 months in. Before that, users saw the same job 4 times.

  3. Don't use Playwright unless you must. It's 10x slower and breaks in CI. Most "dynamic" sites have a JSON endpoint if you look hard enough.

The Stack

  • Frontend: Next.js 14, Tailwind CSS, TypeScript
  • Backend: Next.js API routes, Prisma ORM
  • Database: PostgreSQL on Neon
  • Scrapers: Python, aiohttp, BeautifulSoup, Playwright (few)
  • Hosting: Vercel
  • CI/CD: GitHub Actions
  • Email: Resend
  • Storage: Cloudflare R2 (CVs)
  • Monitoring: Sentry

Check it out: birjob.com


I'm Ismat, a developer in Baku. Happy to answer questions about scraping architecture, running a one-person product, or anything else.

Preview of the birjob

Top comments (0)