Last year, job seekers in Azerbaijan had to check 10+ websites every morning. boss.az, hellojob.az, jobsearch.az, LinkedIn, plus dozens of company career pages. No one aggregated them.
So I built BirJob — a scraper that pulls from 80+ sources into one searchable platform.
Here's how it works under the hood.
The Architecture
GitHub Actions (cron, twice daily)
↓
80+ Python scrapers (aiohttp + BeautifulSoup)
↓
PostgreSQL on Neon (dedup via md5 hash)
↓
Next.js 14 on Vercel (SSR + API routes)
↓
Users search / get alerts via Email + Telegram
The Scraper System
Each scraper extends a BaseScraper class:
class BaseScraper:
async def fetch_url_async(self, url, session):
# aiohttp with retry logic, rate limiting
# returns HTML string or JSON dict
def save_to_db(self, df):
# pandas DataFrame → PostgreSQL
# ON CONFLICT (apply_link) DO UPDATE
# dedup_hash = md5(company + title)
Most sites are simple HTML — BeautifulSoup handles them. A few are SPAs (Next.js, React) that need Playwright. Some use GraphQL APIs which are actually easier to scrape than HTML.
The hardest ones:
- Cloudflare-protected sites — GitHub Actions IPs get blocked. Some I had to disable entirely.
-
Sites that change HTML monthly — CSS class hashes change every deploy. Switched to
__NEXT_DATA__JSON extraction for those. - Rate limiting — concurrency limited to 2 scrapers at a time for stability.
Deduplication
The same job appears on 3-4 boards with slightly different titles. I compute a dedup hash:
md5(lower(trim(company)) || '::' || lower(trim(title)))
Stored as a column with an index. The search query uses DISTINCT ON this hash so users never see the same job twice.
The Numbers
- 80+ sources scraped daily
- 10,000+ active jobs
- 30,000+ candidate profiles (scraped from CV boards)
- 450+ blog articles
- ~700 new jobs per weekday
What It Costs
| Service | Cost |
|---|---|
| Vercel Pro | $20/mo |
| Neon Postgres | $5/mo |
| Resend (email) | Free tier |
| GitHub Actions | Free |
| Cloudflare R2 | ~$0.50/mo |
| Total | ~$25/mo |
Revenue
Sponsored job postings — HR departments pay 30-80 AZN ($18-47) to promote their listing. When they pay, we automatically email matching candidates from our database.
First paying customer: an HR manager who posted a Data Analyst role. We emailed 87 matching candidates within minutes.
What I'd Do Differently
Start with fewer sources. I launched with 91 scrapers. Half broke within a month. Should have started with 20 reliable ones.
Dedup earlier. I added deduplication 3 months in. Before that, users saw the same job 4 times.
Don't use Playwright unless you must. It's 10x slower and breaks in CI. Most "dynamic" sites have a JSON endpoint if you look hard enough.
The Stack
- Frontend: Next.js 14, Tailwind CSS, TypeScript
- Backend: Next.js API routes, Prisma ORM
- Database: PostgreSQL on Neon
- Scrapers: Python, aiohttp, BeautifulSoup, Playwright (few)
- Hosting: Vercel
- CI/CD: GitHub Actions
- Email: Resend
- Storage: Cloudflare R2 (CVs)
- Monitoring: Sentry
Check it out: birjob.com
I'm Ismat, a developer in Baku. Happy to answer questions about scraping architecture, running a one-person product, or anything else.

Top comments (0)