How I Built a Job Aggregator That Scrapes 80+ Sites Daily

#python #webdev #nextjs #scraping

Last year, job seekers in Azerbaijan had to check 10+ websites every morning. boss.az, hellojob.az, jobsearch.az, LinkedIn, plus dozens of company career pages. No one aggregated them.

So I built BirJob — a scraper that pulls from 80+ sources into one searchable platform.

Here's how it works under the hood.

The Architecture

GitHub Actions (cron, twice daily)
    ↓
80+ Python scrapers (aiohttp + BeautifulSoup)
    ↓
PostgreSQL on Neon (dedup via md5 hash)
    ↓
Next.js 14 on Vercel (SSR + API routes)
    ↓
Users search / get alerts via Email + Telegram

The Scraper System

Each scraper extends a BaseScraper class:

class BaseScraper:
    async def fetch_url_async(self, url, session):
        # aiohttp with retry logic, rate limiting
        # returns HTML string or JSON dict

    def save_to_db(self, df):
        # pandas DataFrame → PostgreSQL
        # ON CONFLICT (apply_link) DO UPDATE
        # dedup_hash = md5(company + title)

Most sites are simple HTML — BeautifulSoup handles them. A few are SPAs (Next.js, React) that need Playwright. Some use GraphQL APIs which are actually easier to scrape than HTML.

The hardest ones:

Cloudflare-protected sites — GitHub Actions IPs get blocked. Some I had to disable entirely.
Sites that change HTML monthly — CSS class hashes change every deploy. Switched to __NEXT_DATA__ JSON extraction for those.
Rate limiting — concurrency limited to 2 scrapers at a time for stability.

Deduplication

The same job appears on 3-4 boards with slightly different titles. I compute a dedup hash:

md5(lower(trim(company)) || '::' || lower(trim(title)))

Stored as a column with an index. The search query uses DISTINCT ON this hash so users never see the same job twice.

The Numbers

80+ sources scraped daily
10,000+ active jobs
30,000+ candidate profiles (scraped from CV boards)
450+ blog articles
~700 new jobs per weekday

What It Costs

Service	Cost
Vercel Pro	$20/mo
Neon Postgres	$5/mo
Resend (email)	Free tier
GitHub Actions	Free
Cloudflare R2	~$0.50/mo
Total	~$25/mo

Revenue

Sponsored job postings — HR departments pay 30-80 AZN ($18-47) to promote their listing. When they pay, we automatically email matching candidates from our database.

First paying customer: an HR manager who posted a Data Analyst role. We emailed 87 matching candidates within minutes.

What I'd Do Differently

Start with fewer sources. I launched with 91 scrapers. Half broke within a month. Should have started with 20 reliable ones.
Dedup earlier. I added deduplication 3 months in. Before that, users saw the same job 4 times.
Don't use Playwright unless you must. It's 10x slower and breaks in CI. Most "dynamic" sites have a JSON endpoint if you look hard enough.