After publishing 600+ articles and building 77 web scrapers, I have a clear picture of which tools and APIs actually get used day after day. Here are the ones I keep coming back to.
1. httpx (Python HTTP Client)
Forget requests. httpx supports async, HTTP/2, and has a cleaner API:
import httpx
# Sync
resp = httpx.get("https://api.github.com/repos/encode/httpx")
print(resp.json()["stargazers_count"])
# Async
async with httpx.AsyncClient() as client:
resp = await client.get("https://api.github.com/repos/encode/httpx")
Why I use it: Every scraper I build starts with httpx. It handles redirects, cookies, and timeouts better than requests.
2. jq (Command-Line JSON Processor)
The single most useful tool for working with API responses:
# Pretty print
curl -s https://api.github.com/users/torvalds | jq .
# Extract specific fields
curl -s https://api.github.com/users/torvalds | jq '{name, followers, repos: .public_repos}'
# Filter arrays
curl -s https://api.github.com/users/torvalds/repos | jq '[.[] | select(.stargazers_count > 1000)] | length'
Why I use it: I pipe every API response through jq first. It saves me from writing Python scripts for simple data exploration.
3. DuckDB (In-Process Analytics Database)
SQLite for analytics. Reads CSV, Parquet, and JSON directly:
-- Query a CSV file without importing
SELECT country, COUNT(*) as users
FROM read_csv_auto('users.csv')
GROUP BY country
ORDER BY users DESC
LIMIT 10;
-- Query JSON API response saved to file
SELECT title, score
FROM read_json_auto('hn_stories.json')
WHERE score > 100
ORDER BY score DESC;
Why I use it: When a scraper outputs 100K rows, I analyze them with DuckDB instead of loading everything into pandas.
4. GitHub Actions (Free CI/CD)
I run 8 scrapers on GitHub Actions for $0/month:
on:
schedule:
- cron: "0 */6 * * *"
jobs:
scrape:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install httpx
- run: python scraper.py
- run: |
git add data/
git diff --cached --quiet || git commit -m "data update"
git push
Why I use it: 2,000 free minutes per month. No server to maintain.
5. SQLite + FTS5 (Full-Text Search)
Built-in full-text search that handles millions of documents:
import sqlite3
conn = sqlite3.connect("articles.db")
conn.execute("CREATE VIRTUAL TABLE articles USING fts5(title, body)")
conn.execute("INSERT INTO articles VALUES (?, ?)", ("How to scrape", "Tutorial about web scraping..."))
# Search
results = conn.execute("SELECT * FROM articles WHERE articles MATCH 'scraping'").fetchall()
Why I use it: For any project that needs search, I start with SQLite FTS5 before considering Elasticsearch.
6. Hacker News Firebase API
Real-time access to every HN story, comment, and user — no API key needed:
# Top stories
curl -s https://hacker-news.firebaseio.com/v0/topstories.json | jq '.[0:5]'
# Get a story
curl -s https://hacker-news.firebaseio.com/v0/item/1.json | jq .
Why I use it: I monitor HN for trending topics in my niche. When a relevant post hits the front page, I comment with useful context.
7. Telegram Bot API
The simplest notification system for any automated workflow:
import httpx
def notify(message: str):
httpx.post(
f"https://api.telegram.org/bot{BOT_TOKEN}/sendMessage",
json={"chat_id": CHAT_ID, "text": message}
)
# Use in any scraper
notify("Scraper completed: 1,234 items collected")
Why I use it: Every scraper, every cron job, every GitHub Action sends me a Telegram message on completion or failure.
8. Open-Meteo API
Weather data for any location, no API key:
curl -s "https://api.open-meteo.com/v1/forecast?latitude=55.75&longitude=37.62¤t_weather=true" | jq .current_weather
Why I use it: Free, fast, no authentication. Perfect for any project that needs weather data.
9. ripgrep (rg)
grep but 10x faster. Essential for searching through scraped data:
# Search through all JSON files
rg "error" data/ --type json
# Count matches
rg -c "404" logs/
# Search with context
rg -C 2 "timeout" scraper_*.py
10. Makefiles
I put a Makefile in every project:
.PHONY: scrape test deploy
scrape:
python scraper.py
test:
python -m pytest tests/ -v
deploy:
git push origin main
Why I use it: make scrape is easier to remember than python -m scrapers.main --config prod.yaml --output data/.
The Pattern
All 10 tools share the same traits:
- Free (open source or generous free tier)
- Single-purpose (do one thing well)
- Composable (work together via stdin/stdout/files)
- No vendor lock-in (can switch anytime)
The best developer tools are boring. They just work.
📧 spinov001@gmail.com — I build custom scrapers and data tools. Tell me what you need.
Top comments (0)