I needed to scrape pricing data from 5 websites every day. A VPS would cost $5-20/month. GitHub Actions costs $0.
Here's my exact setup — including cron scheduling, data storage, and error alerts.
Why GitHub Actions for Scraping?
- Free: 2,000 minutes/month on free tier (enough for most scrapers)
- Scheduled: Cron syntax, runs automatically
- No server: No VPS, no Docker, no deployment
- Built-in storage: Commit results to the repo
- Error alerts: GitHub notifies you if a run fails
Step 1: The Scraper Script
Create scraper.py in your repo:
import httpx
import json
from datetime import datetime
from pathlib import Path
def scrape_prices():
targets = [
{'name': 'Product A', 'url': 'https://api.example.com/product/123'},
{'name': 'Product B', 'url': 'https://api.example.com/product/456'},
]
results = []
for target in targets:
try:
response = httpx.get(target['url'], timeout=30)
data = response.json()
results.append({
'name': target['name'],
'price': data.get('price'),
'timestamp': datetime.utcnow().isoformat(),
})
except Exception as e:
results.append({
'name': target['name'],
'error': str(e),
'timestamp': datetime.utcnow().isoformat(),
})
# Save to data/YYYY-MM-DD.json
Path('data').mkdir(exist_ok=True)
filename = f"data/{datetime.utcnow().strftime('%Y-%m-%d')}.json"
existing = []
if Path(filename).exists():
existing = json.loads(Path(filename).read_text())
existing.extend(results)
Path(filename).write_text(json.dumps(existing, indent=2))
print(f"Saved {len(results)} results to {filename}")
if __name__ == '__main__':
scrape_prices()
Step 2: GitHub Actions Workflow
Create .github/workflows/scrape.yml:
name: Daily Scraper
on:
schedule:
- cron: '0 8 * * *' # Every day at 8 AM UTC
- cron: '0 20 * * *' # Every day at 8 PM UTC
workflow_dispatch: # Manual trigger button
jobs:
scrape:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install dependencies
run: pip install httpx
- name: Run scraper
run: python scraper.py
- name: Commit results
run: |
git config user.name 'GitHub Actions'
git config user.email 'actions@github.com'
git add data/
git diff --staged --quiet || git commit -m "Data update $(date -u +'%Y-%m-%d %H:%M')"
git push
Key line: git diff --staged --quiet || git commit — only commits if there are actual changes.
Step 3: Price Change Alerts
Add a comparison step to detect changes:
def check_price_changes():
today = datetime.utcnow().strftime('%Y-%m-%d')
yesterday = (datetime.utcnow() - timedelta(days=1)).strftime('%Y-%m-%d')
today_file = Path(f'data/{today}.json')
yesterday_file = Path(f'data/{yesterday}.json')
if not yesterday_file.exists():
return
today_prices = {r['name']: r['price'] for r in json.loads(today_file.read_text()) if 'price' in r}
yesterday_prices = {r['name']: r['price'] for r in json.loads(yesterday_file.read_text()) if 'price' in r}
for name, price in today_prices.items():
old_price = yesterday_prices.get(name)
if old_price and price != old_price:
change = ((price - old_price) / old_price) * 100
print(f"ALERT: {name} changed {change:+.1f}% (${old_price} → ${price})")
Step 4: Add Browser Scraping (Optional)
For JavaScript-heavy sites, add Playwright:
- name: Install Playwright
run: |
pip install playwright
playwright install chromium
- name: Run browser scraper
run: python browser_scraper.py
from playwright.sync_api import sync_playwright
def scrape_with_browser(url):
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url, wait_until='networkidle')
price = page.locator('.price').text_content()
browser.close()
return price
Cost Breakdown
| What | GitHub Actions | VPS Alternative |
|---|---|---|
| Monthly cost | $0 | $5-20 |
| Runs per day | 2 (cron) + manual | Unlimited |
| Minutes per run | ~1-2 min | N/A |
| Monthly usage | ~60 min / 2000 free | 720 hrs |
| Setup time | 5 minutes | 30+ minutes |
For most scraping projects, GitHub Actions is the better choice. You only need a VPS when you're scraping >100 pages per run or need sub-hourly scheduling.
Common Pitfalls
- Cron timezone: GitHub Actions cron uses UTC. Convert your local time.
- Rate limits: Don't run more than every 15 minutes — GitHub may throttle.
-
Secrets: Never hardcode API keys. Use
Settings → Secrets → Actions. - Large files: Don't commit 100MB+ JSON files. Use Git LFS or external storage.
-
Failing silently: Add
continue-on-error: falseand check notification settings.
More scraping tools and infrastructure: awesome-web-scraping-2026 — 130+ tools organized by category.
Are you running scrapers on GitHub Actions? What's your setup? Comments below 👇
Top comments (0)