DEV Community: ZyVOP

Scraping Social Media Data with Python: X (Twitter), Reddit & Instagram (2026)

ZyVOP — Fri, 24 Jul 2026 11:17:10 +0000

Over 500 million posts are published on X (formerly Twitter) every single day. Reddit hosts more than 100,000 active communities discussing everything from machine learning to local weather. Instagram's 2 billion users generate a continuous stream of product opinions, lifestyle signals, and trend indicators.

Social media data is the closest thing the internet has to real-time public opinion. It powers:

Sentiment analysis — What do people think about your product, brand, or competitor right now?
Trend detection — What topics are accelerating before they go mainstream?
Academic research — Studying misinformation, political discourse, crisis communication
Market intelligence — Tracking competitor mentions, industry conversations
AI training datasets — Social text is gold for fine-tuning conversational models

The challenge in 2026: platforms have made scraping harder than ever. The official X API now starts at $100/month and the free tier is almost useless for research. Instagram aggressively blocks headless browsers. Reddit has rate-limited its API severely after the 2023 controversy.

This guide gives you working approaches for all three platforms — from DIY Python scrapers to managed tools — with honest assessments of what each method can and cannot do.

The 2026 Reality Check: What's Still Scrapable

Before writing a single line of code, here's the honest state of social media scraping in 2026:

Platform	Public Posts	Profile Data	Followers	DMs	Difficulty
X / Twitter	✅ (with effort)	✅	✅	❌	Hard
Reddit	✅ (via API/PRAW)	✅	❌	❌	Easy–Medium
Instagram	⚠️ Public only	⚠️ Public only	❌	❌	Very Hard
TikTok	✅ (via mobile API)	✅	❌	❌	Hard
LinkedIn	⚠️ (see Blog 02)	⚠️ (see Blog 02)	❌	❌	Hard

Key principles:

Never scrape private data, DMs, or anything behind a login you don't own
Always respect robots.txt and Terms of Service for your use case
For academic or commercial research, official APIs or licensed providers are the safer path

Part 1: Scraping X (Twitter) in 2026

The Landscape

X's public guest API was effectively removed in 2023. Almost every endpoint now requires a logged-in session and a CSRF token. The internal GraphQL API that underpins X's own web interface is the main DIY scraping target — but its endpoint identifiers change regularly.

Three working approaches exist in 2026:

Method	Cost	Volume	Reliability	Best for
X API v2 (official)	$100+/month	Limited	High	Authorised developers
Playwright + session	Free	Low-medium	Medium	Research, personal use
ScrapeGraphAI	Paid per req	High	High	Production

Method 1: X API v2 (Official)

For authorised projects, the official API is always the right choice. The free tier gives 500k tweets/month read access:

pip install tweepy

import tweepy
import pandas as pd
from datetime import datetime, timezone, timedelta

# Get credentials at developer.twitter.com
client = tweepy.Client(bearer_token="YOUR_BEARER_TOKEN")

def search_recent_tweets(
    query: str,
    max_results: int = 100,
    days_back: int = 7
) -> pd.DataFrame:
    """
    Search tweets from the last 7 days using X API v2.

    Args:
        query: Search query. Supports operators like:
               - AND/OR: "python scraping OR web crawling"
               - Exact: '"machine learning" tutorial'
               - Exclude: "python -is:retweet"
               - Language: "python lang:en"
               - Has media: "python has:images"
    """
    start_time = datetime.now(timezone.utc) - timedelta(days=days_back)

    # Append filters to reduce noise
    full_query = f"{query} -is:retweet lang:en"

    tweets_data = []
    paginator = tweepy.Paginator(
        client.search_recent_tweets,
        query=full_query,
        tweet_fields=[
            "created_at", "public_metrics", "author_id",
            "lang", "context_annotations", "entities"
        ],
        user_fields=["name", "username", "public_metrics", "verified"],
        expansions=["author_id"],
        start_time=start_time,
        max_results=min(100, max_results),
    )

    users_map = {}

    for response in paginator:
        if not response.data:
            break

        # Build user lookup from includes
        if response.includes and "users" in response.includes:
            for user in response.includes["users"]:
                users_map[user.id] = user

        for tweet in response.data:
            user = users_map.get(tweet.author_id)
            metrics = tweet.public_metrics or {}

            tweets_data.append({
                "tweet_id":       str(tweet.id),
                "text":           tweet.text,
                "created_at":     tweet.created_at,
                "author_id":      str(tweet.author_id),
                "username":       user.username if user else None,
                "name":           user.name if user else None,
                "followers":      user.public_metrics.get("followers_count") if user else None,
                "retweets":       metrics.get("retweet_count", 0),
                "likes":          metrics.get("like_count", 0),
                "replies":        metrics.get("reply_count", 0),
                "quotes":         metrics.get("quote_count", 0),
                "impressions":    metrics.get("impression_count", 0),
                "url":            f"https://x.com/{user.username if user else 'i'}/status/{tweet.id}",
            })

        if len(tweets_data) >= max_results:
            break

    df = pd.DataFrame(tweets_data)
    print(f"Collected {len(df)} tweets for query: '{query}'")
    return df

# Example: track brand mentions
df = search_recent_tweets(
    query="python web scraping 2026",
    max_results=200,
    days_back=7
)
df.to_csv("tweets.csv", index=False)
print(df[["username", "text", "likes", "retweets"]].head(10))

Method 2: Playwright-Based X Scraper (No API Key)

For personal research without API access, Playwright can scrape public search results and profiles using a saved session — exactly as described for LinkedIn in Blog 02:

import asyncio
import random
import json
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

async def save_x_session():
    """Log in to X manually once and save cookies. Run this first."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120"
        )
        page = await context.new_page()
        await page.goto("https://x.com/login")

        print("Log in manually, then press Enter...")
        input()

        cookies = await context.cookies()
        with open("x_cookies.json", "w") as f:
            json.dump(cookies, f)
        print(f"Saved {len(cookies)} cookies.")
        await browser.close()

async def scrape_x_search(query: str, scroll_times: int = 10) -> list[dict]:
    """
    Scrape X search results using a saved session.
    Returns a list of tweet dicts.
    """
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=["--disable-blink-features=AutomationControlled"]
        )
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120",
            viewport={"width": 1280, "height": 900}
        )

        # Load saved session
        with open("x_cookies.json") as f:
            await context.add_cookies(json.load(f))

        page = await context.new_page()
        await stealth_async(page)

        # Navigate to search
        encoded = query.replace(" ", "%20")
        await page.goto(
            f"https://x.com/search?q={encoded}&src=typed_query&f=top",
            wait_until="domcontentloaded"
        )
        await asyncio.sleep(random.uniform(2, 4))

        tweets = []
        seen_ids = set()

        for scroll in range(scroll_times):
            # Extract tweet data from the current viewport
            tweet_articles = await page.query_selector_all("article[data-testid='tweet']")

            for article in tweet_articles:
                try:
                    # Get tweet text
                    text_el = await article.query_selector("[data-testid='tweetText']")
                    text = await text_el.inner_text() if text_el else None

                    # Username
                    user_el = await article.query_selector("[data-testid='User-Name'] a")
                    href = await user_el.get_attribute("href") if user_el else ""
                    username = href.strip("/").split("/")[-1] if href else None

                    # Display name
                    name_els = await article.query_selector_all(
                        "[data-testid='User-Name'] span"
                    )
                    display_name = None
                    for el in name_els:
                        txt = await el.inner_text()
                        if txt and not txt.startswith("@"):
                            display_name = txt
                            break

                    # Engagement stats
                    async def get_stat(testid):
                        el = await article.query_selector(f"[data-testid='{testid}']")
                        if el:
                            txt = await el.inner_text()
                            return txt.strip() or "0"
                        return "0"

                    likes    = await get_stat("like")
                    replies  = await get_stat("reply")
                    retweets = await get_stat("retweet")

                    # Time
                    time_el = await article.query_selector("time")
                    posted_at = await time_el.get_attribute("datetime") if time_el else None

                    # Unique ID to deduplicate
                    tweet_id = f"{username}_{posted_at}"
                    if tweet_id in seen_ids or not text:
                        continue
                    seen_ids.add(tweet_id)

                    tweets.append({
                        "username":     username,
                        "display_name": display_name,
                        "text":         text,
                        "likes":        likes,
                        "replies":      replies,
                        "retweets":     retweets,
                        "posted_at":    posted_at,
                    })
                except Exception:
                    continue

            # Scroll down for more tweets
            await page.evaluate("window.scrollBy(0, window.innerHeight * 1.5)")
            await asyncio.sleep(random.uniform(1.5, 3.0))
            print(f"  Scroll {scroll+1}/{scroll_times} — {len(tweets)} tweets collected")

        await browser.close()

    print(f"\nTotal unique tweets: {len(tweets)}")
    return tweets

# Run
tweets = asyncio.run(scrape_x_search("python scraping 2026", scroll_times=8))
df = pd.DataFrame(tweets)
df.to_csv("x_search_results.csv", index=False)

Part 2: Reddit Scraping with PRAW

Reddit is the most developer-friendly major social platform for data collection. Its official Python wrapper PRAW (Python Reddit API Wrapper) is free, well-documented, and gives access to posts, comments, user profiles, and subreddit data at no cost.

pip install praw pandas

Scraping subreddit posts

import praw
import pandas as pd
from datetime import datetime, timezone

reddit = praw.Reddit(
    client_id="YOUR_CLIENT_ID",
    client_secret="YOUR_CLIENT_SECRET",
    user_agent="PythonResearchBot/1.0 by u/your_username",
)

def scrape_subreddit(
    subreddit_name: str,
    sort: str = "hot",         # "hot", "new", "top", "rising"
    time_filter: str = "week", # "hour", "day", "week", "month", "year", "all"
    limit: int = 500
) -> pd.DataFrame:
    """
    Scrape posts from a subreddit.

    Best for: topic research, trend monitoring, content analysis.
    """
    sub = reddit.subreddit(subreddit_name)

    # Choose sort method
    if sort == "hot":
        posts_gen = sub.hot(limit=limit)
    elif sort == "new":
        posts_gen = sub.new(limit=limit)
    elif sort == "top":
        posts_gen = sub.top(time_filter=time_filter, limit=limit)
    elif sort == "rising":
        posts_gen = sub.rising(limit=limit)
    else:
        posts_gen = sub.hot(limit=limit)

    records = []
    for post in posts_gen:
        records.append({
            "post_id":       post.id,
            "title":         post.title,
            "text":          post.selftext[:500] if post.selftext else None,
            "author":        str(post.author) if post.author else "[deleted]",
            "score":         post.score,
            "upvote_ratio":  post.upvote_ratio,
            "num_comments":  post.num_comments,
            "url":           post.url,
            "permalink":     f"https://reddit.com{post.permalink}",
            "flair":         post.link_flair_text,
            "is_self":       post.is_self,
            "created_utc":   datetime.fromtimestamp(post.created_utc, tz=timezone.utc).isoformat(),
            "subreddit":     subreddit_name,
        })

    df = pd.DataFrame(records)
    print(f"Scraped {len(df)} posts from r/{subreddit_name}")
    return df

# Example: research Python discussions
df = scrape_subreddit("learnpython", sort="top", time_filter="month", limit=500)
df.to_csv("reddit_learnpython.csv", index=False)
print(df[["title", "score", "num_comments"]].head(10))

Scraping post comments (deep dive)

def scrape_post_comments(
    post_url: str,
    max_comments: int = 200,
    include_replies: bool = False
) -> pd.DataFrame:
    """
    Scrape all comments from a Reddit post.
    Useful for sentiment analysis, topic deep-dives, building datasets.
    """
    submission = reddit.submission(url=post_url)

    # Replace "MoreComments" objects to get all comments
    submission.comments.replace_more(limit=0)

    all_comments = submission.comments.list() if include_replies else submission.comments

    records = []
    for comment in list(all_comments)[:max_comments]:
        if not hasattr(comment, "body"):
            continue
        records.append({
            "comment_id":    comment.id,
            "author":        str(comment.author) if comment.author else "[deleted]",
            "body":          comment.body,
            "score":         comment.score,
            "depth":         comment.depth,
            "created_utc":   datetime.fromtimestamp(
                                 comment.created_utc, tz=timezone.utc
                             ).isoformat(),
            "is_op":         comment.is_submitter,
            "awards":        comment.total_awards_received,
        })

    return pd.DataFrame(records)

# Example: deep dive on a specific post
comments_df = scrape_post_comments(
    "https://www.reddit.com/r/MachineLearning/comments/example/",
    max_comments=300
)
print(f"Top comment: {comments_df.sort_values('score', ascending=False).iloc[0]['body'][:200]}")

Multi-subreddit keyword monitoring

def monitor_keyword_across_subreddits(
    keyword: str,
    subreddits: list[str],
    limit_per_sub: int = 100
) -> pd.DataFrame:
    """
    Search for a keyword across multiple subreddits.
    Great for competitive intelligence and brand monitoring.
    """
    all_posts = []

    for sub_name in subreddits:
        print(f"Searching r/{sub_name} for '{keyword}'...")
        try:
            sub = reddit.subreddit(sub_name)
            for post in sub.search(keyword, sort="new", limit=limit_per_sub):
                all_posts.append({
                    "subreddit":    sub_name,
                    "title":        post.title,
                    "text":         post.selftext[:300] if post.selftext else None,
                    "author":       str(post.author) if post.author else "[deleted]",
                    "score":        post.score,
                    "comments":     post.num_comments,
                    "permalink":    f"https://reddit.com{post.permalink}",
                    "created_utc":  datetime.fromtimestamp(
                                        post.created_utc, tz=timezone.utc
                                    ).isoformat(),
                })
        except Exception as e:
            print(f"  Error on r/{sub_name}: {e}")

    df = pd.DataFrame(all_posts)

    # Summary by subreddit
    summary = df.groupby("subreddit").agg(
        posts=("title", "count"),
        avg_score=("score", "mean"),
        total_comments=("comments", "sum")
    ).sort_values("posts", ascending=False)

    print(f"\n── Results for '{keyword}' ──")
    print(summary.to_string())

    return df

# Track "Python scraping" across tech subreddits
df = monitor_keyword_across_subreddits(
    keyword="python scraping",
    subreddits=["Python", "learnpython", "webdev", "datascience", "MachineLearning"],
    limit_per_sub=50
)
df.to_csv("reddit_keyword_monitor.csv", index=False)

Part 3: Instagram Scraping (Public Profiles Only)

Instagram is the most aggressively protected major platform in 2026. The public guest API was removed years ago. Almost all data requires a logged-in session. The GraphQL API underlying Instagram's web interface changes identifiers constantly.

This section covers only public profile and hashtag data — the bare minimum available without login — and the Playwright approach for logged-in access to your own account's data.

Scraping public profile metadata

import httpx
import json
import asyncio

async def get_instagram_profile(username: str) -> dict | None:
    """
    Fetch public profile data for an Instagram username.
    Uses the semi-public shared data endpoint.
    Only works for public accounts.
    """
    url = f"https://www.instagram.com/{username}/?__a=1&__d=dis"

    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 Chrome/120 Safari/537.36"
        ),
        "Accept": "*/*",
        "Referer": "https://www.instagram.com/",
        "X-IG-App-ID": "936619743392459",  # Instagram Web app ID
    }

    async with httpx.AsyncClient(headers=headers, follow_redirects=True) as client:
        try:
            r = await client.get(url, timeout=15)
            if r.status_code == 200:
                data = r.json()
                user = data.get("graphql", {}).get("user", {})
                return {
                    "username":       user.get("username"),
                    "full_name":      user.get("full_name"),
                    "biography":      user.get("biography"),
                    "followers":      user.get("edge_followed_by", {}).get("count"),
                    "following":      user.get("edge_follow", {}).get("count"),
                    "posts":          user.get("edge_owner_to_timeline_media", {}).get("count"),
                    "is_verified":    user.get("is_verified"),
                    "is_business":    user.get("is_business_account"),
                    "profile_url":    f"https://www.instagram.com/{username}/",
                }
        except Exception as e:
            print(f"Error fetching @{username}: {e}")
    return None

# Batch scrape public profiles
async def batch_profile_scrape(usernames: list[str]) -> pd.DataFrame:
    results = []
    for username in usernames:
        print(f"Fetching @{username}...")
        profile = await get_instagram_profile(username)
        if profile:
            results.append(profile)
        await asyncio.sleep(random.uniform(2, 5))
    return pd.DataFrame(results)

profiles_df = asyncio.run(batch_profile_scrape([
    "natgeo", "nasa", "python.learning"
]))
print(profiles_df[["username", "followers", "posts", "is_verified"]])

Playwright-based Instagram scraper (logged-in session)

For your own account's data or influencer research on public accounts:

import asyncio, json, random
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

async def save_instagram_session():
    """Log in manually and save session cookies. Run once."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) "
                       "AppleWebKit/605.1.15 Mobile/15E148 Safari/604.1",
            viewport={"width": 390, "height": 844},
            is_mobile=True,
        )
        page = await context.new_page()
        await page.goto("https://www.instagram.com/accounts/login/")
        print("Log in manually, then press Enter...")
        input()
        cookies = await context.cookies()
        with open("ig_cookies.json", "w") as f:
            json.dump(cookies, f)
        print("Session saved.")
        await browser.close()

async def scrape_instagram_posts(username: str, max_posts: int = 30) -> list[dict]:
    """
    Scrape recent posts from a public Instagram profile.
    Requires a saved session from save_instagram_session().
    """
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=["--no-sandbox", "--disable-blink-features=AutomationControlled"]
        )
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120",
            viewport={"width": 1280, "height": 900}
        )
        with open("ig_cookies.json") as f:
            await context.add_cookies(json.load(f))

        page = await context.new_page()
        await stealth_async(page)

        await page.goto(
            f"https://www.instagram.com/{username}/",
            wait_until="domcontentloaded"
        )
        await asyncio.sleep(random.uniform(2, 4))

        posts = []
        # Scroll to load posts
        for _ in range(max_posts // 12 + 1):
            post_links = await page.query_selector_all("article a[href*='/p/']")
            for link in post_links:
                href = await link.get_attribute("href")
                if href and href not in [p.get("href") for p in posts]:
                    posts.append({"href": href, "username": username})
            await page.evaluate("window.scrollBy(0, window.innerHeight)")
            await asyncio.sleep(random.uniform(1, 2.5))

        # Deduplicate and limit
        seen = set()
        unique_posts = []
        for p in posts:
            if p["href"] not in seen:
                seen.add(p["href"])
                unique_posts.append(p)
        posts = unique_posts[:max_posts]

        await browser.close()

    print(f"Found {len(posts)} posts for @{username}")
    return posts

Part 4: Sentiment Analysis on Collected Data

Once you've collected social media data, analysing sentiment turns raw text into actionable signals. The transformers library from HuggingFace provides excellent pre-trained sentiment models:

pip install transformers torch

from transformers import pipeline
import pandas as pd

# Load a pre-trained sentiment model (downloads ~67MB on first run)
# "cardiffnlp/twitter-roberta-base-sentiment-latest" is trained on tweets
sentiment_pipeline = pipeline(
    "sentiment-analysis",
    model="cardiffnlp/twitter-roberta-base-sentiment-latest",
    tokenizer="cardiffnlp/twitter-roberta-base-sentiment-latest",
    max_length=512,
    truncation=True,
)

LABEL_MAP = {
    "LABEL_0": "negative",
    "LABEL_1": "neutral",
    "LABEL_2": "positive",
}

def analyse_sentiment_batch(texts: list[str], batch_size: int = 32) -> list[dict]:
    """
    Run sentiment analysis on a list of social media texts.
    Returns list of {label, score} dicts.
    """
    results = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        # Clean texts — remove URLs and excessive whitespace
        cleaned = [
            " ".join(word for word in t.split() if not word.startswith("http"))[:512]
            for t in batch
        ]
        preds = sentiment_pipeline(cleaned)
        for pred in preds:
            results.append({
                "sentiment": LABEL_MAP.get(pred["label"], pred["label"]),
                "confidence": round(pred["score"], 3),
            })
        print(f"  Processed {min(i + batch_size, len(texts))}/{len(texts)}")

    return results

def sentiment_report(df: pd.DataFrame, text_col: str = "text") -> pd.DataFrame:
    """
    Enrich a DataFrame with sentiment scores and produce a summary report.
    """
    texts = df[text_col].fillna("").tolist()
    print(f"Analysing sentiment for {len(texts)} posts...")
    sentiments = analyse_sentiment_batch(texts)

    df["sentiment"]   = [s["sentiment"]  for s in sentiments]
    df["confidence"]  = [s["confidence"] for s in sentiments]

    # Summary
    summary = df["sentiment"].value_counts(normalize=True).mul(100).round(1)
    print("\n── Sentiment Distribution ──")
    for label, pct in summary.items():
        bar = "█" * int(pct / 2)
        print(f"  {label:10s}: {bar} {pct}%")

    return df

# Apply to Reddit data
reddit_df = pd.read_csv("reddit_learnpython.csv")
reddit_df = sentiment_report(reddit_df, text_col="title")
reddit_df.to_csv("reddit_with_sentiment.csv", index=False)

# Apply to X data
tweets_df = pd.read_csv("tweets.csv")
tweets_df = sentiment_report(tweets_df, text_col="text")
tweets_df.to_csv("tweets_with_sentiment.csv", index=False)

Part 5: Building a Brand Monitoring Dashboard

Combine all sources into a unified brand monitoring pipeline:

import asyncio
import pandas as pd
from datetime import datetime, timezone

async def run_brand_monitor(brand_name: str, competitor: str = None) -> dict:
    """
    Collect brand mentions from Reddit and X, analyse sentiment,
    and produce a unified brand health report.
    """
    report = {
        "brand":       brand_name,
        "run_at":      datetime.now(timezone.utc).isoformat(),
        "reddit":      {},
        "twitter":     {},
        "sentiment":   {},
    }

    # ── Reddit ────────────────────────────────────────────────
    print(f"\n[1/3] Scraping Reddit for '{brand_name}'...")
    reddit_df = monitor_keyword_across_subreddits(
        keyword=brand_name,
        subreddits=["technology", "Python", "webdev", "datascience", "programming"],
        limit_per_sub=50
    )
    reddit_df = sentiment_report(reddit_df, text_col="title")

    report["reddit"] = {
        "total_posts":     len(reddit_df),
        "avg_score":       round(reddit_df["score"].mean(), 1),
        "total_comments":  int(reddit_df["comments"].sum()),
        "top_post":        reddit_df.sort_values("score", ascending=False).iloc[0]["title"],
        "top_subreddit":   reddit_df["subreddit"].value_counts().index[0],
    }

    # ── Sentiment ─────────────────────────────────────────────
    all_sentiments = pd.concat([
        reddit_df[["sentiment", "confidence"]],
    ])
    sentiment_counts = all_sentiments["sentiment"].value_counts(normalize=True).mul(100)
    report["sentiment"] = {
        "positive_pct": round(sentiment_counts.get("positive", 0), 1),
        "neutral_pct":  round(sentiment_counts.get("neutral", 0), 1),
        "negative_pct": round(sentiment_counts.get("negative", 0), 1),
        "overall":      sentiment_counts.idxmax(),
    }

    # ── Print Report ──────────────────────────────────────────
    print(f"\n{'═'*50}")
    print(f"  BRAND MONITOR: {brand_name.upper()}")
    print(f"  Run at: {report['run_at']}")
    print(f"{'═'*50}")
    print(f"\n  Reddit mentions:   {report['reddit']['total_posts']}")
    print(f"  Avg post score:    {report['reddit']['avg_score']}")
    print(f"  Most active sub:   r/{report['reddit']['top_subreddit']}")
    print(f"\n  Sentiment:")
    print(f"    ✅ Positive:  {report['sentiment']['positive_pct']}%")
    print(f"    😐 Neutral:   {report['sentiment']['neutral_pct']}%")
    print(f"    ❌ Negative:  {report['sentiment']['negative_pct']}%")
    print(f"    Overall:      {report['sentiment']['overall'].upper()}")
    print(f"\n  Top post: \"{report['reddit']['top_post'][:80]}...\"")

    return report

# Run
report = asyncio.run(run_brand_monitor("python scraping"))

Rate Limits and Platform Rules Reference

Platform	Daily free limit	Rate limit reset	Key restriction
X API v2 (Free)	500k tweets/month read	15 min windows	No historical data
X API v2 (Basic, $100/mo)	10M tweets/month	15 min windows	30-day history
Reddit PRAW	1,000 req/10 min	Rolling	Public posts only
Instagram (no auth)	~200 req/hour	Hourly	Public profiles only
Instagram (with session)	200–500 actions/day	Daily	Avoid aggressive scraping

FAQ

Q: snscrape is broken in 2026 — what should I use instead? snscrape's X scraper broke in 2023 when X removed the guest API and has not been reliably fixed. Use Tweepy with X API v2 for authorised projects. For DIY scraping, use the Playwright approach above.

Q: Can I scrape private Instagram posts? No — and you shouldn't. Scraping private account data without consent violates Instagram's Terms of Service, GDPR, and potentially criminal hacking laws in many jurisdictions. Only collect publicly available data.

Q: What's the best model for social media sentiment analysis in 2026?cardiffnlp/twitter-roberta-base-sentiment-latest is trained specifically on tweets and outperforms generic models. For Reddit, which is longer-form, distilbert-base-uncased-finetuned-sst-2-english works well. For multilingual content, use nlptown/bert-base-multilingual-uncased-sentiment.

Q: Reddit changed its API — is PRAW still free? Yes. PRAW's free tier (500 requests per 10 minutes) was not affected by the 2023 API changes. Only third-party apps accessing certain endpoints at very high volume were impacted. Standard research use via PRAW remains free.

Q: How do I handle deleted posts in my Reddit dataset? Posts with [deleted] author or [removed] body are common. Filter them before analysis: df = df[df['author'] != '[deleted]'] and df = df[df['text'] != '[removed]'].

Summary

Platform	Best Method	Key Library	What You Can Get
X / Twitter	Tweepy (API v2)	tweepy	Tweets, metrics, author data
X / Twitter	Playwright + session	playwright	Search results, profiles
Reddit	PRAW	praw	Posts, comments, subreddits
Instagram	httpx + cookies	httpx	Public profiles, post counts
All platforms	HuggingFace	transformers	Sentiment, emotion, topics

Originally published on ZyVOP

💡 For more articles like this, subscribe to the ZyVOP newsletter!

Sign in With Google in Node.js — Without Passport

ZyVOP — Tue, 21 Jul 2026 06:14:19 +0000

Most "Sign in with Google" tutorials hand you Passport.js and a dozen middleware functions, then stop before explaining what any of them do. When the OAuth dance fails in production — wrong redirect URI, missing refresh token, expired state — you're left debugging a black box.

This post builds the whole flow from scratch: PKCE code generation, the authorization redirect, the callback handler that validates everything before touching the database, token exchange, and refresh token handling. No Passport, no auth library. Just four endpoints and the Google API.

This is the third post in Zyvop's auth series. If you haven't read the 2FA implementation or the magic link post yet, the session module here is the same one from those — a JWT in an httpOnly cookie. The new parts are everything that happens before you issue that cookie.

Source: https://github.com/zyvop27-cmyk/zyvop-blogs/tree/master/oauth-google

Setup

git clone [YOUR_GITHUB_REPO_URL]
cd oauth-google
npm install
cp .env.example .env

Then in Google Cloud Console:

APIs & Services → Credentials → Create Credentials → OAuth 2.0 Client ID
Application type: Web application
Authorized Redirect URIs: http://localhost:3000/auth/google/callback
Copy the Client ID and Secret into .env

Generate a JWT_SECRET:

node -e "console.log(require('crypto').randomBytes(32).toString('hex'))"

npm start
# open http://localhost:3000

Why PKCE, and why does it matter for server-side apps

PKCE (Proof Key for Code Exchange) was designed for mobile and single-page apps that can't keep a client secret truly secret. For server-side apps you already have a secret, so adding PKCE is belt-and-suspenders. Google now requires it for all new OAuth clients regardless of app type, so it's no longer optional — it's just how OAuth with Google works in 2026.

The mechanism: before sending the user to Google, you generate a random code_verifier, hash it to produce a code_challenge, and send the challenge to Google.

When the user comes back with an authorization code, you exchange the code by also sending the original verifier. Google hashes it and checks it matches the challenge it stored. Anyone who intercepts the authorization code without the original verifier can't use it.

// src/lib/pkce.js
import { randomBytes, createHash } from "node:crypto";

export function generateVerifier() {
  // 32 bytes → 43-char URL-safe base64, no padding (RFC 7636)
  return randomBytes(32).toString("base64url");
}

export function deriveChallenge(verifier) {
  return createHash("sha256").update(verifier).digest("base64url");
}

export function verifyChallenge(verifier, expectedChallenge) {
  if (typeof verifier !== "string" || verifier.length < 43) return false;
  return deriveChallenge(verifier) === expectedChallenge;
}

base64url is the base64 alphabet with + → -, / → _, and all = padding stripped. Node has it built in since v14 — no library needed.

CSRF protection: the state parameter

The state parameter is how you prevent cross-site request forgery on the callback. You generate a random value, send it to Google, and Google sends it back in the callback. You verify it matches what you sent before touching anything else.

The state also carries the PKCE verifier. When the user lands on the callback, you need the verifier to complete the token exchange — but you can't put it in the URL, since that exposes it to server logs and referrer headers.

You could store it in a signed httpOnly cookie, but that means setting an additional cookie before the redirect and cleaning it up on callback. The state parameter is a neater single-point solution: one random value carries both the CSRF token and the verifier retrieval key.

// src/lib/state.js
import { randomBytes } from "node:crypto";

const _store = new Map();
const STATE_TTL_MS = 10 * 60 * 1000;

export function createState(verifier) {
  const state = randomBytes(24).toString("hex");
  _store.set(state, { verifier, expiresAt: Date.now() + STATE_TTL_MS });
  return state;
}

export function consumeState(state) {
  if (typeof state !== "string" || !state) return null;

  const entry = _store.get(state);
  _store.delete(state); // delete before returning — no replay window

  if (!entry) return null;
  if (Date.now() > entry.expiresAt) return null;
  return entry.verifier;
}

consumeState deletes the entry before checking it — not after. The order matters: if you check first and delete second, two simultaneous requests with the same state both pass the check before either deletes.

This way only one gets the verifier back; the other gets null.

The in-memory Map works fine for a single server process. For multiple instances behind a load balancer, move it to Redis with a 10-minute TTL.

Building the authorization URL

// src/lib/google.js
export function buildAuthUrl() {
  const verifier = generateVerifier();
  const challenge = deriveChallenge(verifier);
  const state = createState(verifier);

  const params = new URLSearchParams({
    client_id: process.env.GOOGLE_CLIENT_ID,
    redirect_uri: process.env.GOOGLE_REDIRECT_URI,
    response_type: "code",
    scope: "openid email profile",
    state,
    code_challenge: challenge,
    code_challenge_method: "S256",
    access_type: "offline",   // ask for a refresh token
    prompt: "consent",        // force consent screen so refresh token is always issued
  });

  return { url: `https://accounts.google.com/o/oauth2/v2/auth?${params}` };
}

access_type: "offline" tells Google you want a refresh token. Without it, you only get an access token that expires in an hour, and the user has to sign in again. prompt: "consent" forces the consent screen on every sign-in, which is the only reliable way to get a refresh token every time — Google silently skips it on returning users unless you force it.

The callback: three validations before anything else

When Google redirects the user back, the callback handler does three things before touching the database:

// src/routes/auth.js
router.get("/google/callback", async (req, res) => {
  const { code, state, error } = req.query;

  if (error) return res.redirect("/?error=access_denied");
  if (!code || !state) return res.redirect("/?error=invalid_callback");

  const verifier = consumeState(state);  // validates + consumes in one step
  if (!verifier) return res.redirect("/?error=invalid_state");

  try {
    const { accessToken, refreshToken } = await exchangeCode(code, verifier);
    const userInfo = await fetchUserInfo(accessToken);
    const user = upsertUser(db, userInfo, refreshToken);

    issueSession(res, { sub: user.google_sub, email: user.email,
                        name: user.name, picture: user.picture });
    res.redirect("/dashboard");
  } catch (err) {
    console.error("[oauth] callback failed:", err.message);
    res.redirect("/?error=auth_failed");
  }
});

Check error first — this is what Google sends when the user clicks "Cancel". Then check code and state are present.

consumeState in one call validates the state is real, not expired, and not already used, and returns the PKCE verifier. Only if all three pass do you make any network requests.

Token exchange and userinfo

The token exchange sends the authorization code plus the PKCE verifier to Google's token endpoint:

export async function exchangeCode(code, verifier, fetchFn = fetch) {
  const res = await fetchFn("https://oauth2.googleapis.com/token", {
    method: "POST",
    headers: { "Content-Type": "application/x-www-form-urlencoded" },
    body: new URLSearchParams({
      code,
      client_id: process.env.GOOGLE_CLIENT_ID,
      client_secret: process.env.GOOGLE_CLIENT_SECRET,
      redirect_uri: process.env.GOOGLE_REDIRECT_URI,
      grant_type: "authorization_code",
      code_verifier: verifier,
    }),
  });

  if (!res.ok) {
    const err = await res.text();
    throw new Error(`Token exchange failed (${res.status}): ${err}`);
  }
  const data = await res.json();
  return {
    accessToken: data.access_token,
    refreshToken: data.refresh_token ?? null,
    idToken: data.id_token,
    expiresIn: data.expires_in,
  };
}

Google's response includes an id_token — a JWT containing the user's identity claims. You could decode and verify it locally using Google's published JWK keys, which avoids a second network request.

This post uses the /userinfo endpoint instead, because it always returns current data and sidesteps implementing RS256 signature verification. For high-throughput production code, verify the id_token locally.

The userinfo call is straightforward:

export async function fetchUserInfo(accessToken, fetchFn = fetch) {
  const res = await fetchFn("https://www.googleapis.com/oauth2/v3/userinfo", {
    headers: { Authorization: `Bearer ${accessToken}` },
  });
  if (!res.ok) throw new Error(`Userinfo fetch failed (${res.status})`);

  const data = await res.json();
  if (!data.email_verified) throw new Error("Google account email is not verified");

  return { sub: data.sub, email: data.email, name: data.name ?? null,
           picture: data.picture ?? null, emailVerified: data.email_verified };
}

The email_verified check matters. Google allows accounts with unverified email addresses. Silently accepting one would let someone claim ownership of an email they don't control.

Storing users and handling refresh tokens

Users are keyed by Google's sub (subject), not email. Email addresses can change; sub is permanent for a given Google account. The upsert uses COALESCE to preserve the existing refresh token when the new one is null:

db.prepare(`
  INSERT INTO users (google_sub, email, name, picture, refresh_token)
  VALUES (@sub, @email, @name, @picture, @refreshToken)
  ON CONFLICT (google_sub) DO UPDATE SET
    email         = excluded.email,
    name          = excluded.name,
    picture       = excluded.picture,
    refresh_token = COALESCE(excluded.refresh_token, users.refresh_token),
    last_login    = strftime('%Y-%m-%dT%H:%M:%SZ', 'now')
  RETURNING *
`).get({ sub, email, name, picture, refreshToken: refreshToken ?? null });

Google only sends a refresh token on first sign-in (or after the user revokes access). If you overwrite the stored refresh token with null on subsequent logins, you lose the ability to refresh the user's access token without asking them to sign in again. COALESCE keeps the old one when the new one is absent.

When the access token expires and you need a new one:

export async function refreshAccessToken(refreshToken, fetchFn = fetch) {
  const res = await fetchFn("https://oauth2.googleapis.com/token", {
    method: "POST",
    headers: { "Content-Type": "application/x-www-form-urlencoded" },
    body: new URLSearchParams({
      client_id: process.env.GOOGLE_CLIENT_ID,
      client_secret: process.env.GOOGLE_CLIENT_SECRET,
      refresh_token: refreshToken,
      grant_type: "refresh_token",
    }),
  });
  if (!res.ok) {
    const err = await res.text();
    throw new Error(`Token refresh failed (${res.status}): ${err}`);
  }
  const data = await res.json();
  return { accessToken: data.access_token, expiresIn: data.expires_in };
}

A 400 response from this endpoint usually means the refresh token was revoked — the user removed your app from their Google account. Handle it by clearing the session and prompting a fresh sign-in.

The sign-in page

public/index.html is a single static page with a "Continue with Google" button that links to /auth/google. The interesting part is the error handling — when the OAuth flow fails for any reason, the server redirects back to /?error=<reason>. The page reads that query param and shows a human-readable message:

const ERRORS = {
  access_denied:    "You cancelled the sign-in. Try again when you're ready.",
  invalid_state:    "Something went wrong with the sign-in flow. Please try again.",
  invalid_callback: "The callback from Google was malformed. Please try again.",
  auth_failed:      "Sign-in failed. Please try again or contact support.",
  config:           "The server is misconfigured. Contact the site owner.",
  default:          "Something went wrong. Please try again."
};

const params = new URLSearchParams(location.search);
const errKey = params.get("error");
if (errKey) {
  const box = document.getElementById("error-box");
  box.textContent = ERRORS[errKey] ?? ERRORS.default;
  box.classList.add("visible");
}

Each error key maps to a specific failure point in the callback handler — access_denied when the user clicks Cancel, invalid_state when the state check fails, auth_failed when the token exchange or userinfo call throws. This matters: a generic "something went wrong" on a sign-in page sends users nowhere. Mapping errors to actionable messages costs five lines.

Testing without a Google account

Every external call takes an injectable fetchFn parameter. Tests pass a mock instead:

function mockFetch(status, body) {
  return async () => ({
    ok: status >= 200 && status < 300,
    status,
    json: async () => body,
    text: async () => JSON.stringify(body),
  });
}

test("exchangeCode throws on non-200 response", async () => {
  await assert.rejects(
    () => exchangeCode("code", "verifier", mockFetch(400, { error: "invalid_grant" })),
    /Token exchange failed/
  );
});

test("fetchUserInfo throws if email is not verified", async () => {
  await assert.rejects(
    () => fetchUserInfo("token", mockFetch(200, {
      sub: "123", email: "unverified@example.com", email_verified: false
    })),
    /email is not verified/
  );
});

The state store test exercises expiry by monkey-patching Date.now — no timers, no waiting:

test("expired state entries are not returned", () => {
  const realNow = Date.now;
  const state = createState("verifier");

  Date.now = () => realNow() + 11 * 60 * 1000; // 11 minutes later
  try {
    assert.equal(consumeState(state), null);
  } finally {
    Date.now = realNow;
  }
});

npm test
# tests 31 · pass 31 · fail 0

Trying it with curl

Once the server is running with real Google credentials:

Hit the sign-in page:

curl -I http://localhost:3000/auth/google

HTTP/1.1 302 Found
Location: https://accounts.google.com/o/oauth2/v2/auth?client_id=...&state=...&code_challenge=...

The OAuth flow goes through a browser — there's no way to automate it with curl end-to-end. To check the API after signing in, copy your session cookie from browser DevTools (Application → Cookies) and pass it directly:

curl -H "Cookie: session=<paste-your-jwt-here>" http://localhost:3000/api/me

{
  "sub": "1234567890",
  "email": "you@gmail.com",
  "name": "Your Name",
  "picture": "https://lh3.googleusercontent.com/..."
}

Refresh the access token (stored server-side):

curl -H "Cookie: session=<paste-your-jwt-here>" \
  -X POST http://localhost:3000/auth/refresh

{ "accessToken": "ya29.new-token", "expiresIn": 3599 }

No active session:

curl http://localhost:3000/api/me

{ "error": "Not authenticated." }

Before going to production

Move the state store to Redis — the in-memory Map disappears on restart and won't work across multiple server instances. The interface is the same, just backed by Redis keys with a 10-minute TTL.

Set NODE_ENV=production to enable the Secure flag on the session cookie, which browsers require over HTTPS. Add your production domain to Authorized Redirect URIs in Google Cloud Console — that list is an exact-match allowlist, and the callback will silently fail with redirect_uri_mismatch if your domain isn't on it.

If your threat model requires it, encrypt the refresh_token column at rest before writing to SQLite. A leaked database shouldn't hand an attacker long-lived access to user Google accounts.

Get the code: https://github.com/zyvop27-cmyk/zyvop-blogs/tree/master/oauth-google

Originally published on ZyVOP

💡 For more articles like this, subscribe to the ZyVOP newsletter!

Build a URL Shortener With Click Analytics in Node.js

ZyVOP — Fri, 17 Jul 2026 11:18:46 +0000

Bit.ly and short.io are fine until you want to know which country your clicks are coming from, which referrer is driving traffic, and whether that campaign from last Tuesday is still converting. At that point, you're either paying for a premium plan or wishing you'd just built it yourself.

This post builds a complete URL shortener with analytics — short code generation, SQLite storage, IP geolocation, referrer tracking, and a dashboard showing daily click trends, browser breakdown, and country distribution.

No external analytics service, no database server to spin up. The whole thing runs on SQLite and ships as a single Node.js process.

Source code: https://github.com/zyvop27-cmyk/zyvop-blogs/tree/main/url-shortener

Stack and setup

Express 5 · SQLite (better-sqlite3) · geoip-lite · ua-parser-js · nanoid

git clone [YOUR_GITHUB_REPO_URL]
cd url-shortener
npm install
npm start

Open http://localhost:3000. The database creates itself in data/links.db on first run.

One environment variable is required before clicks will be tracked — HASH_SALT, used to hash IP addresses before storing them. Generate one:

node -e "console.log(require('crypto').randomBytes(32).toString('hex'))"

Add it to .env. Without it, the server starts but logs a warning and skips click recording on every redirect.

Short code generation

The first decision: how to make the codes. Random is the right choice over sequential integers — sequential IDs are trivially enumerable, someone can increment from 1 and hit every link ever created.

Random codes with a good alphabet make that infeasible.

The alphabet deliberately excludes characters that look alike:

// src/lib/shortcode.js
import { customAlphabet } from "nanoid";

const ALPHABET = "ABCDEFGHJKLMNPQRSTUVWXYZabcdefghjkmnpqrstuvwxyz23456789";
const CODE_LENGTH = 7;
const generate = customAlphabet(ALPHABET, CODE_LENGTH);

export function generateCode(db, existsFn, maxAttempts = 5) {
  for (let i = 0; i < maxAttempts; i++) {
    const code = generate();
    if (!existsFn(db, code)) return code;
  }
  throw new Error("Failed to generate a unique code after multiple attempts");
}

No 0, O, I, i, l, o, or 1 — seven characters that look like each other on printed paper or low-res screens. With 55 characters and 7 positions that's 55^7 ≈ 1.52 trillion possible codes. Collision handling is implemented correctly anyway (retry up to 5 times), but with that address space and any realistic link volume, it will never be needed.

The database

SQLite handles everything: URL storage, click records, and all the analytics queries. No Postgres, no Redis, no infrastructure to run. better-sqlite3 is synchronous, which means no connection pool to manage and no async/await ceremony around queries.

Two tables:

CREATE TABLE links (
  id         INTEGER PRIMARY KEY AUTOINCREMENT,
  code       TEXT NOT NULL UNIQUE,
  url        TEXT NOT NULL,
  created_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now'))
);

CREATE TABLE clicks (
  id         INTEGER PRIMARY KEY AUTOINCREMENT,
  link_id    INTEGER NOT NULL REFERENCES links(id),
  clicked_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
  country    TEXT,
  referrer   TEXT,
  browser    TEXT,
  os         TEXT,
  ip_hash    TEXT
);

The analytics queries (top countries, referrers, daily trend) all run against the clicks table with a JOIN on links. The most complex one is the daily trend:

db.prepare(`
  SELECT
    strftime('%Y-%m-%d', clicked_at) AS date,
    COUNT(*) AS clicks
  FROM clicks WHERE link_id = ?
  GROUP BY date ORDER BY date DESC LIMIT 30
`).all(link.id);

SQLite's strftime handles the date bucketing natively. No date library needed.

One gotcha with the database module: better-sqlite3 is synchronous and stateful, so the module caches the connection after the first call. Tests need isolated in-memory databases, not the cached production one. The fix is simple — never cache :memory: connections:

export function getDb(dbPath = DB_PATH) {
  if (dbPath !== ":memory:" && _db) return _db;
  const db = new Database(dbPath);
  // ... schema setup ...
  if (dbPath !== ":memory:") _db = db;
  return db;
}

Each test gets its own fresh in-memory database. Nothing bleeds between test cases.

Recording clicks without slowing down redirects

The redirect handler records the click and then sends the 301. better-sqlite3 is synchronous — it blocks the Node.js event loop during the write — but WAL mode means it's appending to the write-ahead log file rather than syncing the main database.

A single INSERT completes in microseconds in practice. The try/catch ensures analytics failures never break a redirect:

// src/routes/redirect.js
const CODE_RE = /^[A-Za-z0-9]{4,10}$/;

router.get("/:code", (req, res) => {
  const { code } = req.params;
  if (!CODE_RE.test(code)) return res.status(404).send("Link not found.");

  const link = getLinkByCode(db, code);
  if (!link) return res.status(404).send("Link not found.");

  try {
    const clickData = parseClickData(req);
    insertClick(db, link.id, clickData);
  } catch (err) {
    console.error("[redirect] analytics error:", err.message);
  }

  res.redirect(301, link.url);
});

The /:code route must be registered last in server.js. It matches any single-segment path (including /healthz), so any named routes registered after it will never be reached. Express matches in registration order, not specificity order.

One other decision worth noting: 301 vs 302. A 301 (permanent redirect) tells browsers and search engines to cache the destination — follow the link once and the browser may never hit the shortener again.

That means if you later change the destination URL, existing users won't see the change. For a personal shortener that's usually fine; use 302 if you need updatable destinations.

The POST /api/shorten endpoint is rate-limited to 20 requests per hour per IP. Redirect and stats endpoints are not limited — a popular link shouldn't throttle its own clicks.

What gets tracked and how

Three pieces of data come in on every click: the IP address, the User-Agent header, and the Referer header.

// src/lib/analytics.js
export function parseClickData(req) {
  const ip = getClientIp(req);
  const ua = req.headers["user-agent"] ?? "";
  const rawReferrer = req.headers["referer"] || req.headers["referrer"] || null;

  const salt = process.env.HASH_SALT;
  if (!salt) throw new Error("HASH_SALT is not set. Add it to your .env file.");
  const ipHash = crypto.createHash("sha256").update(ip + salt).digest("hex").slice(0, 16);
  const geo = ip ? geoip.lookup(ip) : null;
  const country = geo?.country ?? null;

  const parsed = UAParser(ua);
  const browser = parsed.browser?.name ?? null;
  const os = parsed.os?.name ?? null;

  let referrer = null;
  if (rawReferrer) {
    try {
      referrer = new URL(rawReferrer).hostname;
    } catch {
      referrer = rawReferrer.slice(0, 100);
    }
  }

  return { country, referrer, browser, os, ipHash };
}

The IP gets hashed before storage. Raw IPs are PII — storing them without justification is a compliance headache in most jurisdictions.

A SHA-256 hash of the IP (keyed with a secret from your environment) lets you count unique visitors reliably without keeping the raw address. geoip-lite does the country lookup from a bundled local database, so there's no external API call.

Referrers get normalized to hostname only. https://google.com/search?q=example becomes google.com. The full URL isn't useful for the dashboard, and storing full search queries you didn't ask for is the kind of thing that causes problems later.

The dashboard

The dashboard is a single HTML file that calls the API. No framework:

GET /api/links — all shortened links with total click counts
GET /api/stats/:code — full breakdown for one link (by country, referrer, browser, daily trend)

The daily trend renders as a bar chart built from raw div elements. Bar height is a percentage of the maximum day's count, so no chart library dependency.

Clicking "Stats" on any row loads that link's analytics inline below the table. The stat panel shows total clicks, unique visitors (by hashed IP), and three bar charts. Country shows the top 10 countries, referrer shows where traffic came from, browser breaks down the client distribution.

Trying it with curl

Make sure HASH_SALT is set in .env before running these — without it the server starts and redirects work, but clicks won't be recorded and stats will show zero.

Shorten a URL:

curl -X POST http://localhost:3000/api/shorten \
  -H "Content-Type: application/json" \
  -d '{"url": "https://zyvop.com/building-a-production-ai-agent-in-node-js-tool-calling-the-react-loop-and-error-handling-zzftm}'

{
  "code": "P8WbGd7",
  "shortUrl": "http://localhost:3000/P8WbGd7",
  "url": "https://zyvop.com/building-a-production-ai-agent-in-nodejs"
}

Follow the redirect:

curl -I http://localhost:3000/P8WbGd7

HTTP/1.1 301 Moved Permanently
Location: https://zyvop.com/building-a-production-ai-agent-in-nodejs

Pull the stats:

curl http://localhost:3000/api/stats/P8WbGd7

{
  "link": { "code": "P8WbGd7", "url": "https://zyvop.com/...", "created_at": "2026-07-14T06:00:00Z" },
  "totalClicks": 3,
  "uniqueVisitors": 1,
  "byCountry": [{ "country": "US", "clicks": 3 }],
  "byReferrer": [{ "referrer": "google.com", "clicks": 2 }, { "referrer": "Direct", "clicks": 1 }],
  "byBrowser": [{ "browser": "Chrome", "clicks": 3 }],
  "dailyTrend": [{ "date": "2026-07-14", "clicks": 3 }]
}

"Direct" in byReferrer is what COALESCE(referrer, 'Direct') returns for clicks that arrived without a Referer header — typed directly into the browser bar, opened from a native app, or clicked from an email client that strips referrers.

Invalid URL:

curl -s -w " [%{http_code}]" -X POST http://localhost:3000/api/shorten \
  -H "Content-Type: application/json" \
  -d '{"url": "ftp://notallowed"}'

{"error":"url must be a valid http or https URL."} [400]

Unknown code:

curl -I http://localhost:3000/ZZZZZZZ

HTTP/1.1 404 Not Found

Tests

npm test

# tests 24
# pass  24
# fail   0

24 tests across three suites. The shortcode suite verifies the alphabet, the collision retry logic, and that it throws correctly when retries are exhausted.

The analytics suite tests IP hashing, country lookup, referrer normalization, and user-agent parsing — including the privacy property that raw IPs don't appear in the stored hash. The database suite runs every query against an in-memory SQLite instance: inserts, retrieval, unique visitor counting, daily trend grouping, and the unique constraint on codes.

Taking it further

The setup above works for personal use or a small team. A few things to add before opening it to the public:

Auth on the shorten endpoint — right now anyone who can reach the server can create links. A simple API key check is enough for most cases.
Custom slugs — let users specify POST /api/shorten with an optional customCode field; validate it against the same alphabet and check for conflicts before inserting.
Click buffering — if redirect volume gets high, the synchronous SQLite write on every click becomes the bottleneck. Buffer clicks in memory and flush in batches of 100 or every 5 seconds.
Link expiry — add an expires_at column to links and check it in the redirect handler.

Get the code: https://github.com/zyvop27-cmyk/zyvop-blogs/tree/main/url-shortener

Originally published on ZyVOP

💡 For more articles like this, subscribe to the ZyVOP newsletter!

Four Ways a Refresh Token Request Fails — Only One Means Trouble

ZyVOP — Fri, 10 Jul 2026 07:35:46 +0000

A refresh token exists for one reason: exchange itself for a new access token, once, and then stop being useful. Everything about a good implementation follows from taking "once" literally. This one does — every refresh token is single-use, and grouped with every other token descended from the same login into a family. Calling POST /auth/refresh with a given token can fail four different ways, and three of them are just bookkeeping. The fourth is the one this post is actually about.

One row per token, one family per login

@Entity()
export class RefreshToken {
  @PrimaryGeneratedColumn('uuid')
  id: string;

  @Column()
  userId: string;

  @Index()
  @Column()
  familyId: string;

  @Index({ unique: true })
  @Column()
  tokenHash: string;

  @Column({ default: false })
  used: boolean;

  @Column({ default: false })
  revoked: boolean;

  @Column({ type: 'timestamp' })
  expiresAt: Date;

  @CreateDateColumn()
  createdAt: Date;
}

familyId is what makes "family" more than a metaphor — it's set once, at login, and then carried forward unchanged into every row that session's rotation produces afterward:

async login(@Body() dto: LoginDto) {
  const user = await this.authService.validatePassword(dto.email, dto.password);
  const accessToken = this.authService.issueAccessToken(user.id, user.email);
  const refresh = await this.refreshTokenService.issue(user.id); // no familyId = fresh session
  return { accessToken, refreshToken: refresh.rawToken };
}

Omitting familyId here is the signal that this is a brand new session rather than a rotation — issue() generates a fresh one. Every later call to issue() for this same chain, from inside rotate(), passes that same value back in instead. Two logins from the same user, at the same time, produce two completely independent families — there's nothing linking them beyond sharing a userId, which matters later.

Three ordinary rejections

A token that was never real. Someone sends a string that doesn't hash to anything in the table.

if (!stored) {
  throw new UnauthorizedException('Invalid refresh token');
}

curl -X POST http://localhost:3000/auth/refresh -H "Content-Type: application/json" \
  -d '{"refreshToken":"not-a-real-token-at-all"}'
# -> 401 {"message":"Invalid refresh token", ...}

A token that aged out. Refresh tokens in this implementation live 7 days by default. Past that, they're just gone.

if (stored.expiresAt < new Date()) {
  throw new UnauthorizedException('Refresh token has expired');
}

Confirmed directly rather than waiting a week: backdate a token's expiresAt in Postgres by a day, then try to use it —

{"message":"Refresh token has expired","error":"Unauthorized","statusCode":401}

— exactly the branch that's supposed to fire.

A token whose session already ended. Logout, or (as covered below) a reuse event elsewhere in the same family, flips a revoked flag.

if (stored.revoked) {
  throw new UnauthorizedException('Refresh token has been revoked');
}

None of these three are interesting on their own — they're the normal outcomes of a token being wrong, old, or deliberately ended. The fourth rejection reason looks identical from the outside (same 401, same shape) but means something completely different underneath.

The fourth: a token that's already been spent

if (stored.used) {
  await this.revokeFamily(stored.familyId);
  throw new UnauthorizedException('Refresh token reuse detected; session revoked');
}

stored.used = true;
await this.refreshTokenRepository.save(stored);

return this.issue(stored.userId, stored.familyId);

Every successful refresh marks the token it consumed as used before issuing its replacement. A legitimate client only ever moves forward through that chain — it gets a new token and uses that one next time, never the old one again.

So if an already-used token shows up in a request, the client presenting it isn't following the chain. Either it's a genuine client retrying a request whose response got lost on the way back (a real possibility, not a hypothetical), or it's a second party holding a copy of a token the real owner has already moved past.

There's no way to tell those apart from inside this one request — both look exactly like "this exact token, again" — so both get treated as the same signal: something's wrong with this session, not just this token.

That's why the response isn't "reject this token and move on." It's revokeFamily, which doesn't touch just the token that got reused — it kills every unrevoked row sharing that familyId, including whichever token the legitimate client is currently holding as its "real" one.

Proving that actually happens, not just reading the code and assuming: rotate once (RT1 → RT2), then try RT1 again.

curl -X POST http://localhost:3000/auth/refresh -H "Content-Type: application/json" -d '{"refreshToken":"<RT1>"}'
# -> 401 {"message":"Refresh token reuse detected; session revoked", ...}

RT1 failing is expected — it's already used. The real test is what happens to RT2, which was never itself reused, was legitimately issued, and — if the family weren't revoked — should still work fine:

curl -X POST http://localhost:3000/auth/refresh -H "Content-Type: application/json" -d '{"refreshToken":"<RT2>"}'
# -> 401 {"message":"Refresh token has been revoked", ...}

RT2 dies too, and the error message even confirms why — not "reuse detected" (that already happened, on RT1) but plain "revoked," which is exactly what a downstream consequence of someone else's reuse should look like from RT2's own perspective. This is the actual security property, not the headline description of it: a compromise anywhere in the chain takes down the entire chain, including the part that was never touched.

The blast radius runs the other direction too, and it's worth confirming it actually stops where it should: a second user, bob@example.com, with his own completely separate family, refreshes without incident while all of the above is happening to alice's account. Nothing about bob's session is touched — revokeFamily only ever operates on rows sharing the one familyId it was given, and bob's family was never that one. The failure is total within a compromised family and entirely absent outside it.

Logging out is the same mechanism, aimed on purpose

Every revocation above happened as a side effect of something going wrong. Logout is the identical underlying operation — revokeFamily — called directly, when nothing is wrong at all:

async revokeFamilyByToken(rawToken: string): Promise<void> {
  const stored = await this.refreshTokenRepository.findOne({
    where: { tokenHash: this.hash(rawToken) },
  });
  if (stored) {
    await this.revokeFamily(stored.familyId);
  }
}

curl -X POST http://localhost:3000/auth/logout -H "Content-Type: application/json" \
  -d '{"refreshToken":"<RT>"}'
# -> { "message": "Logged out" }

curl -X POST http://localhost:3000/auth/refresh -H "Content-Type: application/json" \
  -d '{"refreshToken":"<RT>"}'
# -> 401 {"message":"Refresh token has been revoked", ...}

One question this raises immediately: does logging out on one device end every session, or just the one that called it? Worth an actual answer instead of a guess — log in twice for the same user (call them a phone session and a laptop session, two independent families), log out using only the phone's token, then try refreshing each:

phone session, after logout: {"message":"Refresh token has been revoked", ...}
laptop session, untouched:   { "accessToken": "...", "refreshToken": "..." }   (succeeds normally)

Logging out ends the one family attached to the token that was actually presented. It has no way to reach a different family for the same user, because nothing about revokeFamily looks at userId at all — only familyId. Two logins are two unrelated chains that happen to belong to the same person, and ending one says nothing about the other.

What doesn't die with the family

Everything above is about the refresh token. The access token issued alongside RT2 — call it AT2 — is a separate, stateless JWT with its own 15-minute clock, and revoking a refresh family doesn't reach into that clock:

curl http://localhost:3000/auth/me -H "Authorization: Bearer <AT2>"
# -> 200 { "id": "...", "email": "alice@example.com" }

Issued before the reuse event above, and it still works after the whole family got revoked. That's not a gap in the implementation — it's the actual reason access tokens are kept short-lived in the first place. There's no database row for an access token to flip a revoked bit on; the only thing bounding how long a leaked one stays useful is its own expiry. Fifteen minutes is the real ceiling on "how bad is it if this specific token leaks," independent of anything the refresh layer detects or reacts to.

Why the token hash is SHA-256, not bcrypt

The password column in this same project uses bcrypt, and it should — passwords are short, human-chosen, and drawn from a relatively small effective space, so a hash that's deliberately slow is what makes brute-forcing a stolen hash impractical.

A refresh token is a different kind of secret: 40 bytes of crypto.randomBytes, 320 bits of pure randomness, with no human-guessable structure to exploit. Brute-forcing a hash of that is infeasible no matter how fast the hash function is, so bcrypt's deliberate slowness buys nothing here — it just adds cost to every single refresh call for no security benefit. A plain, fast sha256 is the correct tool once the secret's strength comes from entropy rather than from making guessing expensive.

Trying this out, and what it doesn't cover

npm install && cp .env.example .env
docker run --name refresh-postgres -e POSTGRES_PASSWORD=postgres -e POSTGRES_DB=refresh_demo -p 5432:5432 -d postgres:16
npm run start:dev

synchronize: true handles table creation for this demo; swap it for real migrations before this touches production. Register and log in to get real values for the <RT1>/<RT2>/<AT2> placeholders used throughout above:

curl -X POST http://localhost:3000/auth/register -H "Content-Type: application/json" \
  -d '{"email":"alice@example.com","password":"correct-horse-battery"}'

curl -X POST http://localhost:3000/auth/login -H "Content-Type: application/json" \
  -d '{"email":"alice@example.com","password":"correct-horse-battery"}'
# -> { "accessToken": "<AT1>", "refreshToken": "<RT1>" }

Left out on purpose:

A cleanup job for expired and revoked rows. They'll otherwise just accumulate — the BullMQ patterns from elsewhere in this series are a natural fit for sweeping them out.
Rate limiting on /auth/login and /auth/refresh. Both are obvious targets, and the earlier rate-limiting post in this series covers them directly.
A grace period for the "legitimate retry" half of the reuse ambiguity. Some production systems let a just-rotated token keep working for a few seconds, specifically to absorb a lost-response retry before treating reuse as a hard signal. This implementation takes the strict stance instead — simpler to reason about, simpler to verify — which is a deliberate tradeoff, not an oversight.

The full project, including the parts of this walkthrough not reproduced above, is in the refresh-token-rotation-demo repository next to this post.

Originally published on ZyVOP

💡 For more articles like this, subscribe to the ZyVOP newsletter!

Brick: The LLM Router That Skips the Cascade and Still Cuts Your Bill

ZyVOP — Wed, 08 Jul 2026 05:53:41 +0000

Most teams solve "which model should answer this prompt" one of two ways.

Pick one strong model and pay frontier prices for every request, including the ones a 9B open-weight model could handle blind. Or build a cascade: try the cheap model first, escalate on a low-confidence signal, and pay for every miss in tokens and latency.

Brick, from Italian inference provider Regolo.ai, is a third option. It reads a prompt once, scores it against a pool of models across six capability dimensions, and picks a single backend in one shot: no retries, no escalation ladder. Just model: "brick" dropped into an OpenAI-compatible call.

It's a young project with a real, published paper behind it: single digits on GitHub stars as of this writing, one CLI release tag, and a couple of distribution channels still marked "pending" in its own README. Here's what's actually going on under the hood, plus a benchmark nuance worth understanding before you cite the headline number.

What Brick Actually Routes To

Brick sits in front of a pool of models you define, then dispatches each incoming query to whichever one is the cheapest that can still get the answer right. The project frames three situations where that's worth doing:

You already run several models and want each query landing on the right one instead of a manual pick.
You're paying for Claude Code or Codex at frontier rates and want easy turns to land on a cheaper model automatically.
You want one endpoint in front of a mixed pool of OpenAI, GLM, DeepSeek, Kimi, and Qwen models instead of wiring each one up separately.

The default shipped config points every backend at Regolo's own API, but the pool is swappable. Anything that speaks the OpenAI chat-completions format can join.

How the Routing Decision Actually Gets Made

This is the part worth reading the source for, because the README summary undersells how mechanical it is. Every text query goes through two classifiers before a single token gets forwarded anywhere.

Capability vector. A ModernBERT classifier scores the query across six dimensions: coding, creative synthesis, instruction following, math reasoning, planning/agentic behavior, and world knowledge. The output is a soft probability distribution over those six, not a hard label.

Complexity score. A separate Qwen3.5-0.8B model with a LoRA adapter buckets the query into easy, medium, or hard. In the shipped config this runs as a remote call against Regolo's brick-complexity-pro endpoint rather than a local model.

The objective. Every model in the pool also has a fixed six-dimensional skill vector plus a cost weight. Brick computes J_m = D_m + β·a_m for each candidate, where D_m is the distance between the query's capability vector and that model's skill vector, and a_m is a normalized cost penalty.

It picks the model with the smallest J_m. That's the whole decision: one geometric comparison, argmin over the pool.

A separate r knob, ranging from -1 to 1, slides the whole pool between these two poles. Toward -1 favors the cheapest capable model; toward 1 favors the strongest one.

The paper's routing math behind that knob has more moving parts than a single weight, with separate calibration branches for the cost and quality directions. At the API level, though, it's just one number you set at deploy time. The router's own config file, pulled straight from the repo, shows what the pool actually looks like:

models:
  - model: "qwen3.5-9b"
    skill_vector: [0.714788, 0.511538, 0.810109, 0.912146, 0.577072, 0.179876]
    use_reasoning: false
    cost_weight: 0.10
  - model: "deepseek-v4-flash"
    skill_vector: [0.820939, 0.657845, 0.863112, 0.934963, 0.62055, 0.488518]
    use_reasoning: false
    cost_weight: 0.40
  - model: "kimi2.6"
    skill_vector: [0.904272, 0.751595, 0.87018, 0.943892, 0.641863, 0.344074]
    use_reasoning: true
    reasoning_effort: "medium"
    cost_weight: 0.60

Two more blocks sit on top of the math for manual overrides, both documented in apps/router/README.md. keyword_rules in override mode hard-forces specific queries to a named model regardless of what the classifiers say.

The shipped default sends anything containing "debug," "refactor," or "write a function" straight to kimi2.6. A bias mode nudges one capability dimension up without forcing a decision, useful for tilting language-specific queries toward the coding axis.

Multimodal input skips the whole pipeline. Images and audio get preprocessed through OCR or Whisper-compatible speech-to-text first, then either routed as extracted text or forwarded straight to a vision model.

Claude Code and Codex Get a Dedicated Integration

This is where Brick is clearly aiming most of its early adoption. Running brick claude on wires an ANTHROPIC_BASE_URL override into ~/.claude/settings.json and starts a local router.

A new brick-claude option shows up in Claude Code's /model picker, sitting next to the built-in opus/sonnet/haiku aliases rather than replacing them. Five modes control the cost/quality trade-off, and they map directly onto Claude Code's existing thinking-effort slider:

Effort slider	Brick mode	Behavior
low	eco	always haiku
medium	lite	graduated tier, cheaper end
high	mid (default)	graduated tier, balanced
xhigh	pro	graduated tier, stronger end
max	max	always opus

The README confirms the two end tiers precisely: eco always picks haiku, max always picks opus. It doesn't spell out the exact easy/medium/hard-to-model mapping for the three tiers in between, only that they sit on a cost/quality gradient.

Selecting a native model name in the picker bypasses Brick entirely and forwards the request unchanged. That gives you an easy escape hatch if the router ever picks wrong.

A brick claude status command opens a live dashboard showing routed-by-model counts, per-model reasoning-effort distribution, the classifier's easy/medium/hard mix, and an estimated savings percentage against an all-opus baseline.

Codex support exists too, gated as beta, sharing the same five modes and status view. The two share host port 8000, though, so only one can serve at a time.

The Numbers, and What They're Actually Measuring

The project has a real published paper behind it: Brick: Spatial Capability Routing for the Mixture-of-Models (MoM) Paradigm, by Francesco Massa (Regolo.ai) and Marco Cristofanilli (Seeweb, where Regolo.ai is built), posted to arXiv in June 2026.

It reports results on a 5,504-query benchmark called Dataset A, spanning six capability dimensions. Each query is graded by protocol-specific checkers, an LLM judge, or both depending on the task type.

Setting	Accuracy	Cost per call	Latency
Always Qwen3.5-9b	63.17%	$0.0014 (1.0×)	8.1s*
Always DeepSeek-v4-flash	73.69%	$0.0029 (2.1×)	14.7s*
Always Kimi2.6	75.02%	$0.0307 (22.15×)	51.2s
Brick, min-cost profile (r=-1)	63.17%	$0.0014 (1.0×)	9.4s*
Brick, neutral profile (r=0)	74.11%	≈$0.0065 (4.7×)	—
Brick, max-quality profile (r=1)	76.98%	unconfirmed†	22.8s
Oracle bound (3-model pool)	83.25%	n/a	n/a

*Accuracy and per-call dollar cost are from the paper's own cost audit, priced off OpenRouter listings at evaluation time.

Latency figures marked with an asterisk are from the project's README summary table rather than the paper itself. The paper's own headline latency claim: median end-to-end latency drops from 51.2s (always-Kimi) to 22.8s (Brick at max-quality).

†The max-quality profile's cost multiple sits further down the same results table in the paper, past the section I could extract. Treat that cell as unverified rather than a claim either way.

Note also that the project's README reports its own, different, more rounded cost multiples for these same rows (1.0× / 4.0× / 6.0× / 1.5×). This table uses the paper's precise per-call dollar figures instead, since they're traceable to an actual pricing snapshot rather than a rounded illustration.

Against external baselines: Brick's low-cost profile beats FrugalGPT's 69.42% by 2.2 points at comparable cost, and its neutral profile beats Cascade Routing's 73.40% by 0.71 points.

That's the deployment-facing number: did the model Brick picked actually answer correctly. The paper is explicit that this isn't the only question worth asking.

It also defines a second metric, route-exact accuracy: does the router's pick match the literal cheapest model that could have solved the query. Queries none of the three models can solve still count against every router here, oracle included.

That makes it a stricter, cost-efficiency-facing bar than the headline number. Credit where it's due: the paper reports both metrics side by side in its main results table, not buried somewhere else.

On that stricter metric, a trivial "always pick the cheapest model" baseline scores 63.17% by definition, since the cheapest model is the correct pick on 63% of the queries. The interesting comparison is the other routers against that same bar.

Always-Kimi and RouteLLM both dispatch nearly everything to Kimi, and score only 21.28% and 21.31% route-exact accuracy despite matching Kimi's headline number. That's the paper's own evidence that neither is doing real query-level judgment.

FrugalGPT and Cascade Routing land at 31.03% and 28.96%. Brick clears 40.35% at a neutral setting and 63.17% at its cheapest profile, beating every external baseline on the metric that's supposed to be hardest on it.

One loose end worth flagging for anyone who goes digging in the repo themselves: a separate, Italian-language working file in packages/evals/baselines/RESULTS.md reports a route-exact figure for Brick of 46.37%. That number doesn't line up exactly with any of the r-settings in the paper's published table (63.17% / 48.66% / 40.35% at min/low/neutral).

That file reads like an earlier or differently-configured pipeline run than what shipped in the final paper, not a contradiction of it. If you're citing Brick's numbers for something that matters, pull them from the arXiv table rather than that file.

Trying It

The fastest path is the CLI, which self-hosts the router and wires it into Claude Code in two commands. It needs Node 18+ and Docker:

git clone https://github.com/regolo-ai/brick-SR1.git
cd brick-SR1/apps/cli && npm install && npm run build && npm link

brick claude on

For a raw OpenAI-compatible gateway without the CLI, the project's Docker image is documented but, as of this writing, still pending its first push to GHCR under the v2.1.0 tag. Once published, the documented usage looks like this:

docker run --rm -p 18000:18000 \
  -e REGOLO_API_KEY=$REGOLO_API_KEY \
  ghcr.io/regolo-ai/brick:latest

curl http://localhost:18000/v1/chat/completions \
  -H "Authorization: Bearer $REGOLO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"brick","messages":[{"role":"user","content":"Prove that sqrt(2) is irrational"}]}'

The response carries an x-selected-model header telling you which backend actually answered. A math-proof prompt should land on a reasoning-capable model; a bare "hello" should land on the cheapest one in the pool.

I'm quoting these commands straight from the project's own quickstart rather than claiming to have run them myself. With the Docker image and npm package both still pending publication, and no Regolo API key available in a sandboxed environment, there was nothing live to hit.

Where This Project Actually Sits Right Now

A few things worth knowing before you build on this. The router itself descends from vllm-project/spatial-router, an Apache-2.0 project.

Regolo's team added the six-dimension capability classifier, the complexity-score integration, the skill-distance objective, multimodal preprocessing, and the Claude Code passthrough on top. That lineage is disclosed plainly in the repo's NOTICE file, which is exactly how attribution should work on a fork.

As of this writing the repo has zero forks and a GitHub star count still in the single digits. That's normal for a project a few weeks old and not itself a signal of quality either way.

The codebase is a genuine multi-language build. Go and Rust make up the router (Rust handles ML embeddings via candle and classical ML via Linfa, compiled to shared libraries and linked through CGO), Python runs the training and eval pipelines, and TypeScript powers the CLI.

License is Apache-2.0 throughout. Whether it earns a permanent spot in your stack comes down to which problem you actually have.

If you're routing across a real multi-model pool, the paper's numbers are worth your own re-run before you trust them at production scale. If you just want Claude Code to stop burning Opus on "fix this typo" requests, the Claude-specific integration is the more immediately useful half of the project — and it works whether or not you buy the Dataset A accuracy claims at all.

Originally published on ZyVOP

💡 For more articles like this, subscribe to the ZyVOP newsletter!

Python Scraping at Scale: Distributed Crawling Across Multiple Machines (2026)

ZyVOP — Tue, 07 Jul 2026 06:04:28 +0000

Introduction: When One Machine Is No Longer Enough

Most Python developers write simple scrapers—requests, BeautifulSoup, a loop, CSV writer—just to get data once or twice. When scaling or running them over months, the real challenge is building a robust, scheduled system, not the parsing logic. The extraction code is tiny; the critical components are the queue, cache, storage, block detection, recovery loop, and exporters that keep the scraper alive against real‑world internet conditions.

At small scale, a single async Python process can handle thousands of pages per hour. But when you need to crawl millions of pages per day — competitor catalogues, job markets, news archives, e-commerce databases — a single machine hits hard limits: one IP address, one CPU, one point of failure.

Distributed scraping involves spreading tasks across multiple machines to increase speed and volume. Throttling and introducing random delays between requests can help to prevent IP bans, while rotating proxies help distribute requests and avoid detection. Managing sessions and leveraging parallel processing can further enhance efficiency.

This guide shows you exactly how to build a distributed scraping system that runs across multiple machines — using scrapy-redis for shared request queues, Docker for containerisation, and a production proxy management layer that survives real-world conditions.

The Architecture: One Queue, Many Workers

The core insight behind distributed scraping is simple: replace Scrapy's in-memory request queue with a shared Redis queue that every worker machine can read from.

┌─────────────────────────────────────────────────────────────┐
│                    DISTRIBUTED SCRAPER                      │
│                                                             │
│  Master                                                     │
│  ┌──────────┐    Seeds requests    ┌─────────┐             │
│  │  Spider  │──────────────────────▶│  Redis  │             │
│  │ (seed)   │                      │  Queue  │             │
│  └──────────┘                      └────┬────┘             │
│                                         │                   │
│  Workers (any number of machines)       │                   │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐                  │
│  │ Worker 1 │  │ Worker 2 │  │ Worker N │  ← pull jobs     │
│  │ (spider) │  │ (spider) │  │ (spider) │                  │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘                  │
│       │              │              │                        │
│       └──────────────┴──────────────┘                       │
│                          │                                   │
│                    ┌─────▼──────┐                           │
│                    │  MongoDB   │  ← all workers write here │
│                    │ (results)  │                           │
│                    └────────────┘                           │
└─────────────────────────────────────────────────────────────┘

scrapy-redis lets multiple spider instances across machines pull from the same request queue with deduplication and job scheduling. This is the standard approach for large-scale web scraping that Python teams use in production.

Part 1: Setting Up scrapy-redis

pip install scrapy scrapy-redis redis pymongo

The distributed spider

# spiders/distributed_product_spider.py
import scrapy
from scrapy_redis.spiders import RedisSpider
from urllib.parse import urljoin
from datetime import datetime, timezone

class DistributedProductSpider(RedisSpider):
    """
    A Scrapy spider that pulls start URLs from a Redis list
    instead of a hardcoded start_urls list.

    To start crawling, push seed URLs to Redis:
        redis-cli lpush products:start_urls "https://example-store.com/products/"

    Any number of workers running this spider will cooperatively
    process the shared queue — each URL is processed exactly once.
    """

    name         = "distributed_products"
    redis_key    = "products:start_urls"   # Redis list to pop URLs from

    # How many URLs to pop from Redis at once per worker
    redis_batch_size = 16

    # Spider-level settings — override settings.py per spider
    custom_settings = {
        "CONCURRENT_REQUESTS":            32,
        "CONCURRENT_REQUESTS_PER_DOMAIN": 8,
        "DOWNLOAD_DELAY":                 0.5,
        "RANDOMIZE_DOWNLOAD_DELAY":       True,
        "AUTOTHROTTLE_ENABLED":           True,
        "AUTOTHROTTLE_TARGET_CONCURRENCY": 16.0,
        "AUTOTHROTTLE_MAX_DELAY":         5.0,
        "RETRY_TIMES":                    3,
        "RETRY_HTTP_CODES":              [429, 500, 502, 503, 504],
    }

    def parse(self, response):
        """
        Parse a product listing page.
        Yields product detail requests AND discovers pagination links.
        """
        # Follow product links to detail pages
        for href in response.css("a.product-link::attr(href)").getall():
            yield response.follow(href, callback=self.parse_product)

        # Auto-discover pagination — push next page back to Redis queue
        next_page = response.css("a[rel='next']::attr(href)").get()
        if next_page:
            # Use follow() to handle relative URLs
            yield response.follow(next_page, callback=self.parse)

    def parse_product(self, response):
        """Extract product data from a detail page."""
        yield {
            "url":        response.url,
            "title":      response.css("h1::text").get("").strip(),
            "price":      response.css(".price::text").get("").strip(),
            "sku":        response.css("[data-sku]::attr(data-sku)").get(),
            "in_stock":   bool(response.css(".in-stock")),
            "description":response.css(".product-description::text").get("").strip()[:500],
            "scraped_at": datetime.now(timezone.utc).isoformat(),
            "worker_id":  self.settings.get("WORKER_ID", "unknown"),
        }

settings.py for distributed mode

# settings.py

BOT_NAME = "distributed_scraper"
SPIDER_MODULES = ["spiders"]

# ── scrapy-redis settings ─────────────────────────────────────
SCHEDULER            = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS     = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_URL            = "redis://redis-host:6379"

# Keep crawl state across restarts — resumable crawls
SCHEDULER_PERSIST    = True

# How long to wait for new URLs before worker shuts down
SCHEDULER_IDLE_BEFORE_CLOSE = 30

# ── MongoDB pipeline ──────────────────────────────────────────
ITEM_PIPELINES = {
    "pipelines.MongoPipeline":        100,
    "pipelines.DuplicateFilterPipeline": 50,
}
MONGO_URI      = "mongodb://mongo-host:27017/"
MONGO_DATABASE = "distributed_scrape"

# ── Concurrency ───────────────────────────────────────────────
CONCURRENT_REQUESTS              = 32
CONCURRENT_REQUESTS_PER_DOMAIN   = 8
DOWNLOAD_DELAY                   = 0.5
RANDOMIZE_DOWNLOAD_DELAY         = True

# ── Retry ─────────────────────────────────────────────────────
RETRY_ENABLED    = True
RETRY_TIMES      = 3
RETRY_HTTP_CODES = [429, 500, 502, 503, 504, 522, 524]

# ── Logging ───────────────────────────────────────────────────
LOG_LEVEL = "INFO"

# ── Downloader middlewares ────────────────────────────────────
DOWNLOADER_MIDDLEWARES = {
    "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None,
    "middlewares.RotatingProxyMiddleware": 350,
    "middlewares.UserAgentRotationMiddleware": 400,
    "scrapy.downloadermiddlewares.retry.RetryMiddleware": 550,
}

# ── User agent pool ───────────────────────────────────────────
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Chrome/119.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) Chrome/118.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
]

Part 2: Production Middlewares

# middlewares.py
import random
import logging
from scrapy import signals
from scrapy.exceptions import NotConfigured

logger = logging.getLogger(__name__)

class UserAgentRotationMiddleware:
    """Rotate User-Agent on every request."""

    def __init__(self, user_agents):
        self.user_agents = user_agents

    @classmethod
    def from_crawler(cls, crawler):
        agents = crawler.settings.getlist("USER_AGENTS")
        if not agents:
            raise NotConfigured("USER_AGENTS not set")
        return cls(agents)

    def process_request(self, request, spider):
        request.headers["User-Agent"] = random.choice(self.user_agents)

class RotatingProxyMiddleware:
    """
    Rotate through a proxy pool.
    Tracks failures per proxy and removes bad proxies from the pool.
    """

    def __init__(self, proxies, max_failures=5):
        self.proxies      = list(proxies)
        self.failures     = {}
        self.max_failures = max_failures

    @classmethod
    def from_crawler(cls, crawler):
        proxies = crawler.settings.getlist("PROXY_LIST", [])
        if not proxies:
            raise NotConfigured("PROXY_LIST is empty")
        return cls(proxies)

    def process_request(self, request, spider):
        if not self.proxies:
            return  # No proxies left — run without
        proxy = random.choice(self.proxies)
        request.meta["proxy"] = proxy

    def process_response(self, request, response, spider):
        if response.status in (403, 407, 429):
            proxy = request.meta.get("proxy")
            self._mark_failure(proxy)
        return response

    def process_exception(self, request, exception, spider):
        proxy = request.meta.get("proxy")
        self._mark_failure(proxy)

    def _mark_failure(self, proxy):
        if not proxy:
            return
        self.failures[proxy] = self.failures.get(proxy, 0) + 1
        if self.failures[proxy] >= self.max_failures:
            if proxy in self.proxies:
                self.proxies.remove(proxy)
                logger.warning(f"Removed bad proxy: {proxy} ({self.max_failures} failures)")

class BlockDetectionMiddleware:
    """
    Detect common block patterns and trigger retries with a different proxy.
    """

    BLOCK_PATTERNS = [
        "access denied", "captcha", "blocked", "forbidden",
        "unusual traffic", "robot", "automated queries",
        "cf-browser-verification", "ddos-guard",
    ]

    def process_response(self, request, response, spider):
        body_lower = response.text[:2000].lower()

        is_blocked = (
            response.status in (403, 429, 503) or
            any(p in body_lower for p in self.BLOCK_PATTERNS) or
            len(response.text) < 300
        )

        if is_blocked:
            logger.warning(f"Block detected on {request.url} — retrying")
            request.meta["proxy"]    = None  # Force new proxy on retry
            request.dont_filter      = True
            return request           # Re-schedule the request

        return response

Part 3: MongoDB Pipeline with Bulk Writes

# pipelines.py
import logging
from datetime import datetime, timezone
from pymongo import MongoClient, UpdateOne
from pymongo.errors import BulkWriteError
from itemadapter import ItemAdapter

logger = logging.getLogger(__name__)

class DuplicateFilterPipeline:
    """Track seen URLs in memory to drop duplicates before DB write."""

    def open_spider(self, spider):
        self.seen_urls = set()

    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        url = adapter.get("url", "")
        if url in self.seen_urls:
            from scrapy.exceptions import DropItem
            raise DropItem(f"Duplicate URL: {url}")
        self.seen_urls.add(url)
        return item

class MongoPipeline:
    """
    Write scraped items to MongoDB using bulk operations.
    Upserts on URL — safe to re-run without creating duplicates.
    """

    BULK_SIZE  = 200   # Flush every 200 items
    COLLECTION = "products"

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db  = mongo_db
        self._buffer   = []

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get("MONGO_URI", "mongodb://localhost:27017/"),
            mongo_db =crawler.settings.get("MONGO_DATABASE", "scrapy_data"),
        )

    def open_spider(self, spider):
        self.client     = MongoClient(self.mongo_uri)
        self.col        = self.client[self.mongo_db][self.COLLECTION]
        self.col.create_index("url", unique=True)
        self.items_written = 0
        logger.info(f"MongoDB connected: {self.mongo_db}.{self.COLLECTION}")

    def close_spider(self, spider):
        if self._buffer:
            self._flush()
        self.client.close()
        logger.info(f"MongoDB closed. Total items written: {self.items_written}")

    def process_item(self, item, spider):
        self._buffer.append(dict(item))
        if len(self._buffer) >= self.BULK_SIZE:
            self._flush()
        return item

    def _flush(self):
        ops = [
            UpdateOne({"url": doc["url"]}, {"$set": doc}, upsert=True)
            for doc in self._buffer
        ]
        try:
            result = self.col.bulk_write(ops, ordered=False)
            count  = result.upserted_count + result.modified_count
            self.items_written += count
            logger.info(f"Flushed {len(self._buffer)} items → MongoDB")
        except BulkWriteError as e:
            logger.error(f"Bulk write error: {e.details.get('writeErrors', [])[:2]}")
        finally:
            self._buffer.clear()

Part 4: Docker Compose — Multi-Worker Setup

# docker-compose.yml
version: "3.9"

x-worker-base: &worker-base
  build: .
  volumes:
    - .:/app
  environment:
    - REDIS_URL=redis://redis:6379
    - MONGO_URI=mongodb://mongo:27017/
    - PROXY_LIST=${PROXY_LIST}
  depends_on:
    - redis
    - mongo
  restart: unless-stopped

services:

  # ── Infrastructure ──────────────────────────────────────────
  redis:
    image: redis:7-alpine
    ports: ["6379:6379"]
    command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru
    volumes: ["redis_data:/data"]

  mongo:
    image: mongo:7
    ports: ["27017:27017"]
    volumes: ["mongo_data:/data/db"]

  # ── Scrapy Workers ──────────────────────────────────────────
  # Run as many of these as you have cores / IPs
  worker-1:
    <<: *worker-base
    command: >
      scrapy crawl distributed_products
      -s WORKER_ID=worker-1
      -s CONCURRENT_REQUESTS=16
    environment:
      - REDIS_URL=redis://redis:6379
      - MONGO_URI=mongodb://mongo:27017/
      - WORKER_ID=worker-1

  worker-2:
    <<: *worker-base
    command: >
      scrapy crawl distributed_products
      -s WORKER_ID=worker-2
      -s CONCURRENT_REQUESTS=16
    environment:
      - REDIS_URL=redis://redis:6379
      - MONGO_URI=mongodb://mongo:27017/
      - WORKER_ID=worker-2

  worker-3:
    <<: *worker-base
    command: >
      scrapy crawl distributed_products
      -s WORKER_ID=worker-3
      -s CONCURRENT_REQUESTS=16
    environment:
      - REDIS_URL=redis://redis:6379
      - MONGO_URI=mongodb://mongo:27017/
      - WORKER_ID=worker-3

  # ── Seed Service — pushes start URLs into Redis ─────────────
  seeder:
    <<: *worker-base
    command: python seeder.py
    restart: "no"   # Run once then exit

volumes:
  redis_data:
  mongo_data:

Part 5: The Seeder — Feeding URLs Into the Queue

# seeder.py
import redis
import time
import sys
from urllib.parse import urlencode

REDIS_URL   = "redis://localhost:6379"
REDIS_KEY   = "products:start_urls"

def seed_from_list(urls: list[str], batch_size: int = 500):
    """Push a list of start URLs into the Redis queue."""
    r = redis.from_url(REDIS_URL)

    # Clear existing queue if resuming fresh
    existing = r.llen(REDIS_KEY)
    if existing > 0:
        print(f"Queue already has {existing:,} URLs. Adding to it.")

    pushed = 0
    for i in range(0, len(urls), batch_size):
        batch = urls[i:i + batch_size]
        r.rpush(REDIS_KEY, *batch)
        pushed += len(batch)
        print(f"Seeded {pushed:,}/{len(urls):,} URLs")

    print(f"\nDone. Redis queue '{REDIS_KEY}' has {r.llen(REDIS_KEY):,} URLs.")

def seed_paginated_site(
    base_url: str,
    start_page: int = 1,
    end_page: int = 500,
    page_param: str = "page"
):
    """Generate paginated URLs and push to Redis."""
    urls = []
    for page in range(start_page, end_page + 1):
        params = {page_param: page}
        urls.append(f"{base_url}?{urlencode(params)}")

    seed_from_list(urls)

def monitor_queue():
    """Monitor queue depth and worker progress in real time."""
    r = redis.from_url(REDIS_URL)
    print("Monitoring queue depth (Ctrl+C to stop)...")
    try:
        while True:
            depth   = r.llen(REDIS_KEY)
            seen    = r.scard(f"{REDIS_KEY}:dupefilter") or 0
            print(f"  Queue: {depth:,} pending | Seen: {seen:,} processed", end="\r")
            time.sleep(2)
    except KeyboardInterrupt:
        print("\nMonitor stopped.")

if __name__ == "__main__":
    mode = sys.argv[1] if len(sys.argv) > 1 else "seed"

    if mode == "seed":
        seed_paginated_site(
            base_url  = "https://example-store.com/products",
            start_page = 1,
            end_page   = 1000,
        )
    elif mode == "monitor":
        monitor_queue()
    elif mode == "clear":
        r = redis.from_url(REDIS_URL)
        r.delete(REDIS_KEY)
        print(f"Queue '{REDIS_KEY}' cleared.")

Part 6: Scaling to Kubernetes

For truly large-scale crawling across dozens of machines, Kubernetes is the standard deployment target. The key insight: each Scrapy worker is a stateless pod that reads from the shared Redis queue.

# k8s/scrapy-worker-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scrapy-workers
  labels:
    app: scrapy-worker
spec:
  replicas: 10   # Start with 10 workers; scale up/down with kubectl
  selector:
    matchLabels:
      app: scrapy-worker
  template:
    metadata:
      labels:
        app: scrapy-worker
    spec:
      containers:
        - name: scrapy-worker
          image: your-registry/scrapy-worker:latest
          command:
            - scrapy
            - crawl
            - distributed_products
          env:
            - name: REDIS_URL
              valueFrom:
                secretKeyRef:
                  name: scraper-secrets
                  key: redis-url
            - name: MONGO_URI
              valueFrom:
                secretKeyRef:
                  name: scraper-secrets
                  key: mongo-uri
            - name: WORKER_ID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name  # Pod name as worker ID
          resources:
            requests:
              memory: "256Mi"
              cpu:    "250m"
            limits:
              memory: "512Mi"
              cpu:    "500m"

Scale workers up or down instantly:

# Scale to 20 workers
kubectl scale deployment scrapy-workers --replicas=20

# Check worker status
kubectl get pods -l app=scrapy-worker

# View logs from all workers
kubectl logs -l app=scrapy-worker --tail=50

# Auto-scale based on Redis queue depth (requires custom metrics)
kubectl autoscale deployment scrapy-workers --min=2 --max=50

Part 7: Proxy Management at Scale

For serious scraping in 2026, residential proxies are almost always the safer option. Proxy quality matters far more than proxy quantity. A smaller pool of clean residential IPs usually performs much better than massive low-quality networks.

Here's a production proxy manager with health checking:

# proxy_manager.py
import asyncio
import httpx
import random
import time
import json
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class ProxyHealth:
    url:              str
    success_count:    int   = 0
    failure_count:    int   = 0
    last_used:        float = 0.0
    last_success:     float = 0.0
    avg_response_ms:  float = 0.0
    is_banned:        bool  = False

    @property
    def success_rate(self) -> float:
        total = self.success_count + self.failure_count
        return self.success_count / total if total > 0 else 0.0

    @property
    def score(self) -> float:
        """Composite score: higher = better proxy to use."""
        if self.is_banned:
            return 0.0
        recency_bonus = max(0, 1 - (time.time() - self.last_success) / 3600)
        speed_score   = max(0, 1 - self.avg_response_ms / 5000)
        return self.success_rate * 0.6 + recency_bonus * 0.2 + speed_score * 0.2

class ProxyPool:
    """
    Intelligent proxy pool with health tracking and weighted selection.
    Workers register success/failure, pool learns which proxies are best.
    """

    BAN_THRESHOLD = 0.2   # Mark as banned if success rate drops below 20%

    def __init__(self, proxy_urls: list[str]):
        self.proxies = {url: ProxyHealth(url=url) for url in proxy_urls}

    def get_proxy(self, strategy: str = "weighted") -> Optional[str]:
        """Select a proxy using the specified strategy."""
        available = [
            p for p in self.proxies.values()
            if not p.is_banned
        ]

        if not available:
            return None

        if strategy == "random":
            return random.choice(available).url

        elif strategy == "weighted":
            # Weight by score — best proxies get used more often
            scores = [max(p.score, 0.01) for p in available]
            total  = sum(scores)
            weights = [s / total for s in scores]
            return random.choices(available, weights=weights)[0].url

        elif strategy == "round_robin":
            # Sort by last_used timestamp — use least recently used
            available.sort(key=lambda p: p.last_used)
            return available[0].url

        return available[0].url

    def report_success(self, proxy_url: str, response_ms: float):
        if proxy_url in self.proxies:
            p = self.proxies[proxy_url]
            p.success_count  += 1
            p.last_used       = time.time()
            p.last_success    = time.time()
            # Rolling average response time
            p.avg_response_ms = (p.avg_response_ms * 0.8 + response_ms * 0.2)

    def report_failure(self, proxy_url: str):
        if proxy_url in self.proxies:
            p = self.proxies[proxy_url]
            p.failure_count += 1
            p.last_used      = time.time()
            # Auto-ban proxies with very low success rate
            if p.failure_count > 10 and p.success_rate < self.BAN_THRESHOLD:
                p.is_banned = True
                print(f"  Proxy banned (success rate {p.success_rate:.0%}): {proxy_url}")

    def get_stats(self) -> dict:
        active  = [p for p in self.proxies.values() if not p.is_banned]
        banned  = [p for p in self.proxies.values() if p.is_banned]
        avg_sr  = sum(p.success_rate for p in active) / len(active) if active else 0

        return {
            "total":    len(self.proxies),
            "active":   len(active),
            "banned":   len(banned),
            "avg_success_rate": f"{avg_sr:.1%}",
            "best_proxy": max(active, key=lambda p: p.score).url if active else None,
        }

async def health_check_proxies(pool: ProxyPool, test_url: str = "https://httpbin.org/ip"):
    """
    Periodically check all proxies and un-ban those that have recovered.
    Run this as a background task.
    """
    async with httpx.AsyncClient(timeout=10) as client:
        for url, proxy in list(pool.proxies.items()):
            if not proxy.is_banned:
                continue
            try:
                start = time.time()
                r = await client.get(test_url, proxies={"https": url})
                if r.status_code == 200:
                    elapsed_ms    = (time.time() - start) * 1000
                    proxy.is_banned = False
                    pool.report_success(url, elapsed_ms)
                    print(f"  Proxy recovered: {url}")
            except Exception:
                pass   # Still banned

    stats = pool.get_stats()
    print(f"Proxy pool: {stats['active']} active, {stats['banned']} banned, "
          f"avg success rate: {stats['avg_success_rate']}")

Part 8: Monitoring Your Distributed Crawl

# monitor.py — real-time crawl progress dashboard
import redis
import time
import json
from pymongo import MongoClient
from datetime import datetime

def live_dashboard(redis_url: str, mongo_uri: str, refresh_seconds: int = 5):
    """Print a live crawl progress dashboard to the terminal."""
    r   = redis.from_url(redis_url)
    db  = MongoClient(mongo_uri)["distributed_scrape"]

    start_time   = time.time()
    prev_count   = 0

    try:
        while True:
            # Queue stats
            queue_depth  = r.llen("products:start_urls")
            seen_count   = r.scard("products:start_urls:dupefilter") or 0

            # DB stats
            items_stored = db["products"].count_documents({})
            items_delta  = items_stored - prev_count
            rate_per_min = items_delta * (60 / refresh_seconds)
            prev_count   = items_stored

            # Worker stats (scrapy-redis stores worker heartbeats)
            workers = r.smembers("scrapy:workers") or set()

            # Elapsed
            elapsed = time.time() - start_time
            h, m    = divmod(int(elapsed), 3600)
            m, s    = divmod(m, 60)

            print(f"\033[2J\033[H")   # Clear screen
            print(f"{'═'*55}")
            print(f"  DISTRIBUTED SCRAPE MONITOR — {datetime.now().strftime('%H:%M:%S')}")
            print(f"{'═'*55}")
            print(f"  Runtime:       {h:02d}h {m:02d}m {s:02d}s")
            print(f"  Active workers:{len(workers)}")
            print(f"{'─'*55}")
            print(f"  Queue depth:   {queue_depth:>10,}  (URLs remaining)")
            print(f"  URLs seen:     {seen_count:>10,}  (deduplicated total)")
            print(f"  Items stored:  {items_stored:>10,}  (in MongoDB)")
            print(f"  Rate:          {rate_per_min:>10.0f}  items/minute")
            print(f"{'─'*55}")

            if rate_per_min > 0 and queue_depth > 0:
                eta_mins = queue_depth / rate_per_min
                h2, m2   = divmod(int(eta_mins * 60), 3600)
                m2, s2   = divmod(m2, 60)
                print(f"  ETA:           {h2:02d}h {m2:02d}m  (estimated)")

            print(f"{'═'*55}")
            time.sleep(refresh_seconds)

    except KeyboardInterrupt:
        print("\nMonitor stopped.")

if __name__ == "__main__":
    live_dashboard(
        redis_url="redis://localhost:6379",
        mongo_uri="mongodb://localhost:27017/",
    )

Part 9: Resumable Crawls

One of the biggest advantages of scrapy-redis is that crawls are inherently resumable. If a worker crashes or you need to add more workers mid-crawl, simply restart:

# Start a fresh crawl
python seeder.py seed

# Launch workers (they'll pick up from where they left off if SCHEDULER_PERSIST=True)
docker-compose up --scale worker=5

# Pause all workers (Ctrl+C in docker-compose)
# Resume later — queue state is preserved in Redis
docker-compose up --scale worker=10   # Can add more workers

To completely reset a crawl:

# Clear the queue and deduplication filter
redis-cli del products:start_urls
redis-cli del products:start_urls:dupefilter
python seeder.py seed   # Re-seed with fresh URLs

Performance Numbers: What to Expect

One machine first, then scale out. Tune Scrapy performance optimization settings before throwing hardware at the problem — raise CONCURRENT_REQUESTS, turn on AUTOTHROTTLE, and enable HTTP caching. If one server can't handle scraping millions of pages, set up a shared queue for distributed crawling across multiple workers.

Real-world throughput benchmarks:

Setup	Pages/hour	Cost estimate
1 worker, no proxy	~8,000	Free
1 worker + 10 proxies	~25,000	~$5/day
5 workers + 50 proxies	~120,000	~$20/day
20 workers + 200 proxies	~500,000	~$80/day
100 workers (Kubernetes)	~2,500,000	~$350/day

Summary

Component	Tool	Role
Spider	Scrapy + scrapy-redis	Crawl logic + distributed request handling
Queue	Redis	Shared URL queue with built-in deduplication
Worker deployment	Docker Compose / Kubernetes	Horizontal scale, stateless workers
Proxy management	Custom ProxyPool	Health-tracked, weighted proxy selection
Storage	MongoDB (bulk upserts)	Centralised, deduplicated results
Monitoring	Custom dashboard + Flower	Real-time progress and worker health
Resumability	SCHEDULER_PERSIST=True	Crash-safe, restartable crawls

Originally published on ZyVOP

💡 For more articles like this, subscribe to the ZyVOP newsletter!

AI Hype Meets Reality: Security Risks, Thin ROI, and the Rise of Skepticism | The AI Daily Roundup

ZyVOP — Mon, 06 Jul 2026 03:00:08 +0000

Trend Overview: From Hype to Hard Evidence

Across today’s headlines a single narrative emerges – the AI boom is colliding with hard‑won lessons about security, economics, and user agency. Companies are pulling back on unvetted models, researchers are exposing the limits of productivity gains, and regulators and users alike are demanding concrete safeguards.

Why the Shift Matters

When AI tools promise “instant expertise” or “autonomous vulnerability discovery,” the cost of failure is no longer abstract. A data breach, a wasted budget, or a legal liability can cripple a firm faster than a missed hype cycle. Simultaneously, the promised “AI‑powered productivity” is proving to be a marginal gain that evaporates before it reaches a paycheck. The convergence of security scares and thin ROI forces senior leaders to re‑evaluate AI adoption strategies, shifting capital toward proven, auditable solutions.

Evidence from the Day’s Stories

Security Backlash

Alibaba bans Claude Code after internal alarms about a potential backdoor. The move signals that even tech giants will enforce “zero‑trust” policies when model provenance is uncertain.
Claude Mythos preview triggers a 3.5× spike in high‑severity CVEs. Anthropic’s own showcase of autonomous vulnerability discovery has unintentionally amplified the attack surface, prompting a wave of disclosures from Microsoft, Google, Apple, and AWS partners.
Claude’s “memory” feature under fire. Engineers report zero performance benefit from retaining session transcripts, highlighting a broader pattern: added complexity without measurable security or efficiency gains.

Economic Reality Check

Study finds AI saves only ~3% of work hours, and less than 5% of that time translates into higher pay. The gap between lab‑controlled speedups (15‑55%) and real‑world payroll data underscores a “leaky bucket” problem.
Yann LeCun admits current AI isn’t “smart” and bets on next‑generation systems that can handle real‑world data. His candid assessment validates the productivity findings: today’s models excel at narrow tasks but falter on general, embodied intelligence.

User Control & Transparency

Kagi adds an AI toggle, letting users disable AI‑driven search features. The option reflects growing demand for “opt‑out” mechanisms when AI adds noise rather than value.
Google’s Gemini Code Assist is being retired after less than a year, suggesting that even well‑funded services struggle to sustain adoption when the perceived ROI is low.

Who Wins, Who Loses

Beneficiaries: Security‑focused vendors (vulnerability‑management platforms, zero‑trust providers), audit‑ready AI platforms that expose provenance, and enterprises that adopt a measured, task‑specific AI strategy.

Losers: Companies banking on blanket AI adoption without clear use‑case validation, hype‑driven product launches, and consultants who sell “AI transformation” without quantifiable outcomes.

What Changes Next?

Stricter governance: Expect more corporate bans similar to Alibaba’s, and internal policies that require independent model audits before deployment.
Metrics‑first adoption: Teams will benchmark AI impact against payroll and P&L data before scaling, mirroring the methodology of the Danish study.
Feature pruning: Products that add “bells and whistles” (e.g., session‑transcript memory, generic code assistants) will be trimmed or sunset unless they demonstrate clear ROI.
Regulatory focus on AI‑driven security: As vulnerability spikes become visible, regulators may mandate disclosure of AI‑generated exploits and require companies to certify model safety.

Bottom Line

The AI industry is entering a phase of “skeptical scaling.” Security incidents, modest productivity gains, and user‑driven demand for control are forcing a recalibration. Leaders who embed rigorous measurement, enforce zero‑trust model policies, and prioritize real‑world value will capture the next wave of AI‑enabled advantage.

Originally published on ZyVOP

💡 For more articles like this, subscribe to the ZyVOP newsletter!

Passwordless Login With Magic Links in Node.js

ZyVOP — Sun, 05 Jul 2026 03:28:14 +0000

I've seen developers implement magic link auth four different ways and get it wrong three of them. The broken versions all share the same flaw: they use a regular get + del to verify tokens, which means two simultaneous requests with the same token both pass.

User A clicks the link in Gmail. The prefetch scanner in their email client hits it half a second earlier. Server issues two sessions. Neither user knows, but one of them is now authenticated in a context they didn't control.

This post builds it right. By the end you'll have a working Node.js implementation with proper atomic token verification, email enumeration protection, an Ethereal fallback so you can see the full flow without any SMTP configuration, and 19 tests that run without Redis or a real mail server.

Source code: http://github.com/zyvop27-cmyk/zyvop-blogs/tree/main/magic-link-auth

The one decision that matters most

Before touching any code: the difference between a safe magic link implementation and a broken one is a single Redis command.

The broken pattern:

const email = await redis.get(key);
if (email) await redis.del(key);
return email;

Between the get and the del, another request can read the same key. Both get a valid email back, both get sessions issued.

getDel collapses that into one atomic operation:

const email = await redis.getDel(key);
return email ?? null;

Redis executes this as a single command. No window — one request gets the email, any concurrent request gets null.

That's why Redis fits this problem better than a Postgres row. You'd need a transaction and an advisory lock to get the same guarantee from a relational database.

The full verify function adds a length check before the Redis call. A token that's too short or too long gets rejected immediately, no round-trip needed:

// src/lib/token.js
import crypto from "node:crypto";

const TOKEN_PREFIX = "magic:";
const TOKEN_TTL_SECONDS = 15 * 60;
const TOKEN_BYTES = 32;

export async function createToken(redis, email) {
  const token = crypto.randomBytes(TOKEN_BYTES).toString("hex");
  await redis.set(TOKEN_PREFIX + token, email.toLowerCase(), { EX: TOKEN_TTL_SECONDS });
  return token;
}

export async function verifyToken(redis, token) {
  if (!token || typeof token !== "string" || token.length !== TOKEN_BYTES * 2) {
    return null;
  }
  return (await redis.getDel(TOKEN_PREFIX + token)) ?? null;
}

crypto.randomBytes(32) gives 256 bits of entropy — 64 hex characters. Brute-forcing that against a 15-minute window isn't happening. The email gets lowercased on creation so User@Example.COM and user@example.com don't end up as separate keys that never match.

Redis also handles expiry natively via the EX option. No cron job to clean up stale tokens, no WHERE expires_at < NOW() queries — they disappear on their own.

What the user actually sees

The sign-in page is minimal HTML with one real piece of JavaScript: it handles the ?error=link_invalid query param that the server redirects to when a token has expired or already been used.

<!-- public/index.html (abbreviated) -->
<form id="form">
  <input type="email" id="email" placeholder="you@example.com" required>
  <button type="submit">Send sign-in link</button>
</form>
<div id="msg" class="message"></div>

<script>
  const ERRORS = {
    link_invalid: "That link has expired or already been used. Please request a new one.",
    default: "Something went wrong. Please try again."
  };

  // Show error from redirect after a failed verify attempt
  const errKey = new URLSearchParams(location.search).get("error");
  if (errKey) showMsg(ERRORS[errKey] ?? ERRORS.default, "error");

  document.getElementById("form").addEventListener("submit", async (e) => {
    e.preventDefault();
    const email = document.getElementById("email").value.trim();
    const btn = e.target.querySelector("button");

    btn.disabled = true;
    btn.textContent = "Sending…";

    const res = await fetch("/auth/request", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ email })
    });

    const data = await res.json();

    if (res.ok) {
      showMsg("Check your inbox — a sign-in link is on its way.", "success");
      document.getElementById("form").style.display = "none";
    } else {
      showMsg(data.error ?? ERRORS.default, "error");
      btn.disabled = false;
      btn.textContent = "Send sign-in link";
    }
  });
</script>

The ?error=link_invalid flow matters more than it looks. When verification fails — expired token, already used, someone typed a URL wrong — it redirects to /?error=link_invalid instead of returning a JSON 400.

Users arrive at /auth/verify by clicking a link in their email client, not via a fetch call. A JSON error on a blank page is a dead end; a redirect back to the sign-in form with a clear message is something a person can act on.

Sending the email without configuring SMTP

Production email uses SMTP_HOST, SMTP_PORT, SMTP_USER, and SMTP_PASS. Without those, the mailer silently creates an Ethereal test account and logs a preview URL. Ethereal is a real catch-all SMTP service — emails don't deliver anywhere, but you can open the URL and see the full rendered email, including the magic link button, right in your browser.

// src/lib/mailer.js
import nodemailer from "nodemailer";

let transporter = null;

async function getTransporter() {
  if (transporter) return transporter;

  if (process.env.SMTP_HOST) {
    transporter = nodemailer.createTransport({
      host: process.env.SMTP_HOST,
      port: Number(process.env.SMTP_PORT) || 587,
      secure: process.env.SMTP_SECURE === "true",
      auth: { user: process.env.SMTP_USER, pass: process.env.SMTP_PASS }
    });
  } else {
    const testAccount = await nodemailer.createTestAccount();
    transporter = nodemailer.createTransport({
      host: "smtp.ethereal.email",
      port: 587,
      auth: { user: testAccount.user, pass: testAccount.pass }
    });
    console.log(`[mailer] Ethereal test account: ${testAccount.user}`);
  }

  return transporter;
}

export async function sendMagicLink(to, magicUrl) {
  const transport = await getTransporter();

  const info = await transport.sendMail({
    from: process.env.EMAIL_FROM || '"Magic Link Auth" <no-reply@example.com>',
    to,
    subject: "Your sign-in link",
    text: `Sign-in link (expires in 15 minutes, single use):\n\n${magicUrl}`,
    html: `
      <p>Click below to sign in. Expires in <strong>15 minutes</strong>, single use.</p>
      <p style="margin:24px 0">
        <a href="${magicUrl}" style="background:#111;color:#fff;padding:12px 24px;border-radius:6px;text-decoration:none">
          Sign in
        </a>
      </p>
      <p style="color:#999;font-size:12px">Didn't request this? Safe to ignore.</p>
    `
  });

  if (!process.env.SMTP_HOST) {
    console.log(`[mailer] Preview URL: ${nodemailer.getTestMessageUrl(info)}`);
  }
}

The transporter is cached in a module-level variable. nodemailer.createTestAccount() makes a real HTTP call to Ethereal — you don't want that per-request.

For production, Resend and Postmark are the cleaner options over raw SMTP. They handle deliverability, bounce handling, and SPF/DKIM automatically. Hook them in via the SMTP_HOST and friends in .env.

Sessions, cookies, and why SameSite: lax is correct here

After verification, the token is gone and the user needs something that persists across requests. A JWT in an httpOnly cookie is the right shape: stateless (no server-side session table), survives page refreshes, inaccessible to JavaScript running on the page.

// src/lib/session.js
import jwt from "jsonwebtoken";

const COOKIE_NAME = "session";
const SESSION_TTL_SECONDS = 7 * 24 * 60 * 60;

function getSecret() {
  const secret = process.env.JWT_SECRET;
  if (!secret || secret.length < 32) {
    throw new Error("JWT_SECRET must be set and at least 32 characters long");
  }
  return secret;
}

export function issueSession(res, email) {
  const token = jwt.sign({ email }, getSecret(), { expiresIn: SESSION_TTL_SECONDS });
  res.cookie(COOKIE_NAME, token, {
    httpOnly: true,
    secure: process.env.NODE_ENV === "production",
    sameSite: "lax",
    maxAge: SESSION_TTL_SECONDS * 1000
  });
}

export function requireAuth(req, res, next) {
  const token = req.cookies?.[COOKIE_NAME];
  if (!token) return res.status(401).json({ error: "Not authenticated." });

  try {
    const payload = jwt.verify(token, getSecret());
    req.user = { email: payload.email };
    next();
  } catch {
    res.clearCookie(COOKIE_NAME);
    return res.status(401).json({ error: "Session expired. Please sign in again." });
  }
}

sameSite: "lax" needs explaining here because it interacts directly with how magic links work. When a user clicks a link in their email client, that's a top-level navigation — the browser follows it like a normal page load. Lax allows the cookie to be sent on those top-level navigations from external origins.

Strict would block that, requiring another sign-in if a user arrives from any external link. None requires HTTPS everywhere and opens CSRF exposure. Lax is the right call for this pattern.

secure only goes on in production. Without that conditional, local development over HTTP would silently fail to set the cookie and you'd spend an hour wondering why sessions don't persist.

The two routes that do the work

// src/routes/auth.js
import { Router } from "express";
import { createToken, verifyToken } from "../lib/token.js";
import { sendMagicLink as defaultSendMagicLink } from "../lib/mailer.js";
import { issueSession } from "../lib/session.js";

const EMAIL_RE = /^[^\s@]+@[^\s@]+\.[^\s@]+$/;

export function createAuthRouter(redis, { sendMagicLink = defaultSendMagicLink } = {}) {
  const router = new Router();

  router.post("/request", async (req, res) => {
    const email = (req.body?.email ?? "").trim().toLowerCase();

    if (!EMAIL_RE.test(email)) {
      return res.status(400).json({ error: "A valid email address is required." });
    }

    try {
      const token = await createToken(redis, email);
      const appUrl = process.env.APP_URL || `http://localhost:${process.env.PORT || 3000}`;
      await sendMagicLink(email, `${appUrl}/auth/verify?token=${token}`);
      res.json({ ok: true, message: "Check your inbox for a sign-in link." });
    } catch (err) {
      console.error("[auth] request failed:", err);
      res.status(500).json({ error: "Failed to send the sign-in link. Please try again." });
    }
  });

  router.get("/verify", async (req, res) => {
    const email = await verifyToken(redis, req.query.token);
    if (!email) return res.redirect("/?error=link_invalid");
    issueSession(res, email);
    res.redirect("/dashboard");
  });

  router.post("/logout", (req, res) => {
    res.clearCookie("session");
    res.redirect("/");
  });

  return router;
}

/request returns 200 whether the email exists in your system or not. A 404 for unknown emails would tell anyone who tries that an address isn't registered — a user enumeration leak. The response is always "check your inbox," whether you're a real user or someone probing your database.

The sendMagicLink function is passed in as a default parameter, not imported at the top. That's the dependency injection hook tests use — swap it for a no-op, no SMTP connection required.

Running the whole thing

# Start Redis (Docker is fastest)
docker run -d -p 6379:6379 redis:alpine

# Install deps and start
npm install && cp .env.example .env && npm start

The server needs at minimum a JWT_SECRET in .env. Generate one:

node -e "console.log(require('crypto').randomBytes(32).toString('hex'))"

Request a link:

curl -X POST http://localhost:3000/auth/request \
  -H "Content-Type: application/json" \
  -d '{"email": "you@example.com"}'

{ "ok": true, "message": "Check your inbox for a sign-in link." }

Your server console will print something like:

[mailer] Preview URL: https://ethereal.email/message/WaQKMgKddxQDoou

Open that URL, click the "Sign in" button in the rendered email. You'll land on the dashboard. The token is gone — clicking the link again redirects to /?error=link_invalid.

Check your session after clicking (saves cookie to cookies.txt):

curl -c cookies.txt -b cookies.txt http://localhost:3000/api/me

{ "email": "you@example.com" }

Invalid email format:

curl -s -X POST http://localhost:3000/auth/request \
  -H "Content-Type: application/json" \
  -d '{"email": "notanemail"}'

{ "error": "A valid email address is required." }

Rate limiter — 5 requests per 15 minutes per IP, sixth gets a 429:

for i in $(seq 1 6); do
  curl -s -o /dev/null -w "request $i -> %{http_code}\n" \
    -X POST http://localhost:3000/auth/request \
    -H "Content-Type: application/json" \
    -d '{"email":"test@example.com"}'
done

request 1 -> 200
request 2 -> 200
request 3 -> 200
request 4 -> 200
request 5 -> 200
request 6 -> 429

Five is tight enough to stop inbox flooding and loose enough that a real user who typo'd their address gets a few retries.

Expired or already-used token:

curl -v "http://localhost:3000/auth/verify?token=$( python3 -c 'print("a"*64)')" 2>&1 | grep "Location:"

< Location: /?error=link_invalid

Tests

19 tests, no Redis or SMTP connection needed. The token suite runs against a plain in-memory Map mimicking the Redis interface; the session suite sets JWT_SECRET in before() and cleans up after; the route suite injects the no-op mailer:

npm test

# tests 19
# pass  19
# fail   0

If you want to verify the atomic single-use behavior beyond the unit test, run the server with a real Redis instance and hit /auth/verify with the same token from two curl commands fired in parallel:

TOKEN="paste-a-real-token-from-console-here"
curl "http://localhost:3000/auth/verify?token=$TOKEN" &
curl "http://localhost:3000/auth/verify?token=$TOKEN" &
wait

One will redirect to /dashboard. The other will redirect to /?error=link_invalid. That's getDel doing its job.

Before going live

Set NODE_ENV=production — the Secure cookie flag only activates in production, and without HTTPS the cookie won't be sent by browsers at all. Point REDIS_URL at a managed instance (Upstash has a free tier and works well with this setup). Swap Ethereal for a transactional email provider; Resend has a generous free tier and a clean Node.js SDK. Put the whole thing behind nginx or Caddy for TLS termination.

One thing the repo doesn't include but you'll want eventually: a users table or equivalent. Right now any email can request a link and get a session — there's no concept of "registered users." Adding a check in /request that verifies the email exists in your database before sending the link is one line, but the shape of that check depends on your stack.

Get the code: http://github.com/zyvop27-cmyk/zyvop-blogs/tree/main/magic-link-auth

Originally published on ZyVOP

💡 For more articles like this, subscribe to the ZyVOP newsletter!

Rate Limiting Alone Won't Stop a Patient Attacker

ZyVOP — Sat, 04 Jul 2026 02:29:26 +0000

@nestjs/throttler counts requests per IP address within a time window. That's the entire mechanism — no concept of accounts, passwords, or "this one person is being targeted." Point it at a login endpoint and it will correctly stop a script hammering /login a hundred times a second. It will do nothing against someone guessing a password once every few minutes, or spreading attempts across a dozen IPs, because that was never the problem it's built to solve.

Below is both halves of a real defense: the IP throttle for volume, and a small Redis-backed service that locks by email for everything the throttle misses. Every response shape, status code, and timing claim here came from actually running the requests, not from the docs — including one interaction between two timers that will quietly undo a lockout's own "fresh start" if you're not paying attention to how they relate.

The throttle, and what it actually returns

A baseline policy on every route, plus a stricter override on login specifically:

// app.module.ts — global baseline: 20 requests/60s per IP, everywhere
ThrottlerModule.forRoot([{ name: 'default', ttl: 60000, limit: 20 }]),
providers: [{ provide: APP_GUARD, useClass: ThrottlerGuard }],

// auth.controller.ts — login gets a stricter limit than the rest of the API
@Throttle({ default: { limit: 10, ttl: 60000 } })
@Post('login')
async login(@Body() dto: LoginDto) { /* ... */ }

Whether that override replaces the global 20/60s or stacks on top of it isn't obvious from reading the decorator, so it's worth confirming rather than guessing: it replaces. A route-level @Throttle with the same policy name (default) fully takes over for that route. Eleven rapid requests to /auth/register (no override, rides the global policy) all came back 201; the same eleven against /auth/login would have tripped its 10-request limit well before the last one.

The response headers on a request that's still under the limit:

X-RateLimit-Limit: 10
X-RateLimit-Remaining: 9
X-RateLimit-Reset: 60

And once it's crossed:

// HTTP 429
{ "statusCode": 429, "message": "ThrottlerException: Too Many Requests" }

That's the library's real, unmodified default — not a placeholder for something cleaner. It reads like a stringified exception because that's more or less what it is, and it's worth routing through @nestjs/throttler's exception factory if this needs to match the rest of an API's error shape.

Where the throttle's job ends

None of that knows or cares who's logging in. Ten requests a minute from one IP against alice@example.com and ten requests a minute from ten different IPs, one attempt each, all against alice@example.com, look completely different to a per-IP counter — the second pattern never trips it, no matter how long it continues. That's the gap a second, IP-independent mechanism has to cover.

Locking the account, not the address

async recordFailure(email: string): Promise<FailureResult> {
  const key = this.attemptsKey(email);
  const attempts = await this.redis.incr(key);
  if (attempts === 1) {
    await this.redis.expire(key, this.windowSeconds);
  }

  if (attempts >= this.maxAttempts) {
    await this.redis.set(this.lockKey(email), '1', 'EX', this.lockoutSeconds);
    return { attempts, locked: true, retryAfterSeconds: this.lockoutSeconds };
  }

  return { attempts, locked: false };
}

async recordSuccess(email: string): Promise<void> {
  await this.redis.del(this.attemptsKey(email), this.lockKey(email));
}

INCR against a key that doesn't exist yet starts it at 1, so there's no separate setup step for a first failure. The counter's expiry is set once — only when attempts === 1 — so a streak of failures within one window doesn't keep pushing its own deadline out; it's one streak, with one expiry, not a self-renewing one.

Both key-building helpers lowercase the email first:

private attemptsKey(email: string): string {
  return `login-attempts:${email.toLowerCase()}`;
}

Skipping that normalization would open a real gap, not just a style nitpick — so rather than assume it works, this was tested directly: three failures as dana@example.com, then two more as Dana@Example.COM, a different casing entirely. The Redis counter read 5 afterward, under a single key, and a sixth attempt in yet another casing (DANA@EXAMPLE.COM) — this time with the correct password — still came back locked. Without the .toLowerCase(), those three casings would have been three separate five-strike budgets instead of one.

Two timers that look independent and aren't

The service has two separate durations: how long the failure counter itself lives (windowSeconds), and how long an actual lock lasts once triggered (lockoutSeconds). Treating them as unrelated knobs is the natural first instinct, and it's wrong — tested directly, with a 10-second window and a 6-second lock:

lock status right after the 6s lock expires: { locked: false }
attempts count at that same moment:          5

The lock is gone, but the counter — on its own, longer, 10-second clock — hasn't caught up yet. It's still sitting at 5. One more failed attempt right then doesn't start over at 1; it pushes the existing counter to 6, still over the 5-attempt threshold, and the account locks again immediately:

result of one failure right after unlock: { attempts: 6, locked: true }

So "unlocked" didn't mean "clean slate" here — it meant "one more mistake and you're back in." That might be exactly the behavior you want (harsher consequences for someone who fails again right after a lockout), but it should be a choice, not a side effect of two numbers that happened to get picked independently. Defaulting LOGIN_LOCKOUT_WINDOW_SECONDS and LOGIN_LOCKOUT_DURATION_SECONDS to the same value is what makes the out-of-the-box behavior a genuine reset: by the time the lock is gone, so is the counter it was based on. Tested with matched timers instead of mismatched ones, the positive case is exactly as boring as it should be — wait out the lock, and a correct password just works:

curl -X POST http://localhost:3000/auth/login -H "Content-Type: application/json" \
  -d '{"email":"alice@example.com","password":"correct-horse-battery"}'
# -> 201 { "accessToken": "..." }   (no failures logged anywhere afterward)

The failure that triggers a lock doesn't announce it

try {
  const user = await this.authService.validatePassword(dto.email, dto.password);
  await this.loginAttemptService.recordSuccess(dto.email);
  return { accessToken: this.authService.issueToken(user) };
} catch (err) {
  await this.loginAttemptService.recordFailure(dto.email);
  throw err; // same error whether or not this failure just triggered a lock
}

Four wrong-password attempts return four identical 401s. The fifth — the one that actually crosses the threshold — returns that same 401, not a warning. The lock only becomes visible on whatever comes next, correct password included. Confirmed end to end: four 401s, a fifth 401 that locked the account behind the scenes, and only the sixth request got the 429. Nothing in that fifth response tells anyone it was the last try, which is one less signal an attacker gets to calibrate around.

Both mechanisms, same request

# alice fails 5 times, then even the right password is rejected
curl -X POST http://localhost:3000/auth/login -H "Content-Type: application/json" \
  -d '{"email":"alice@example.com","password":"wrong-password"}'
# -> 401  (x5)

curl -X POST http://localhost:3000/auth/login -H "Content-Type: application/json" \
  -d '{"email":"alice@example.com","password":"correct-horse-battery"}'
# -> 429 {"statusCode":429,"message":"Account temporarily locked...","reason":"ACCOUNT_LOCKED"}

# bob, meanwhile, is entirely unaffected
curl -X POST http://localhost:3000/auth/login -H "Content-Type: application/json" \
  -d '{"email":"bob@example.com","password":"correct-horse-battery"}'
# -> 201 { "accessToken": "..." }

The lockout 429 and the throttler's 429 share a status code but not a body — "reason":"ACCOUNT_LOCKED" versus the generic ThrottlerException message — so nothing downstream has to guess which mechanism fired.

One sequencing detail if you're replicating this yourself: the IP throttle counts every call to /login in its window, not just the ones in whatever you'd mentally group as "the throttle test." Running the lockout sequence above first, then immediately sending a batch of requests to check the throttle, tripped the 429 on the third request of that batch rather than somewhere near the tenth — because the eight requests just spent on the lockout scenario were already sitting in the same 60-second window. Not a bug, just a shared counter that doesn't know your test plan has phases.

Running it

Needs Postgres and Redis; Docker is the fastest path to both:

git clone <your-repo-url> && cd rate-limiting-nestjs-demo
npm install && cp .env.example .env
docker run --name rl-postgres -e POSTGRES_PASSWORD=postgres -e POSTGRES_DB=ratelimit_demo -p 5432:5432 -d postgres:16
docker run --name rl-redis -p 6379:6379 -d redis:7
npm run start:dev

The shipped defaults (5 attempts, a 15-minute window and lock) are real production numbers, not demo placeholders, but they're still a starting point rather than a universal answer — how forgiving to be with a legitimate user who fumbles their password a few times is a judgment call specific to what's being protected. Lower LOGIN_LOCKOUT_WINDOW_SECONDS/LOGIN_LOCKOUT_DURATION_SECONDS in your own .env if you want to watch a lock expire without an actual 15-minute wait.

What's deliberately not here: the in-memory throttle storage this demo uses doesn't share counters across multiple app instances, so a horizontally-scaled deployment needs a shared store (@nestjs/throttler supports pluggable backends, Redis included) or the limit quietly stops meaning what it says. There's also no lockout notification to the account owner, and no CAPTCHA as a third layer for public-facing forms — both reasonable additions, both left out here to keep the two mechanisms that are built easy to see clearly.

The full project — this module, the lockout service, and a README with the complete setup and curl walkthrough — is in the rate-limiting-nestjs-demo repository alongside this post.

Originally published on ZyVOP

💡 For more articles like this, subscribe to the ZyVOP newsletter!

Background Jobs in NestJS with BullMQ: A Complete Walkthrough

ZyVOP — Fri, 03 Jul 2026 04:10:06 +0000

Calling a slow, occasionally-unreliable upstream API — an LLM provider, a payment processor, a third-party webhook — directly inside a request handler ties that request's fate to the upstream call's fate. If it's slow, the client waits on the full round trip. If it fails, the handler has to decide on the spot whether to fail the whole request or quietly fall back to something else, which is exactly the kind of decision that's easy to get wrong under pressure and hard to notice when it goes wrong silently.

This post builds a background job pipeline in NestJS using BullMQ: a request that enqueues work and returns immediately, a worker that processes it with automatic retries, and a durable record a client can poll for the outcome. It's demonstrated on an async draft-generation endpoint — the same shape as a real LLM-backed content pipeline — with a couple of details about BullMQ's actual runtime behavior that are easy to get wrong if you're going off the docs alone. The full project, tested end-to-end against real Postgres and Redis, is linked at the end.

The shape of the problem

A synchronous version of this endpoint looks simple:

@Post('drafts')
async create(@Body() dto: CreateDraftDto) {
  const content = await this.llmProvider.generate(dto.topic); // could take 5-30s, could fail
  return this.draftsRepository.save({ topic: dto.topic, content });
}

The request now blocks for as long as the LLM call takes, and a single transient failure (a timeout, a rate limit, a brief provider outage) becomes a failed request with nothing to show for it — no record that it was attempted, nothing to retry, nothing to inspect afterward.

The fix is to decouple "accept the request" from "do the work":

@Post('drafts')
create(@Body() dto: CreateDraftDto) {
  return this.draftsService.enqueue(dto); // returns in milliseconds
}

enqueue() writes a pending row and hands the job to BullMQ. A separate worker process picks it up, retries it automatically on failure, and updates that row when it's done — successfully or not.

Two sources of truth, on purpose

This implementation keeps two records of a job's state, deliberately:

BullMQ's own state, in Redis — which attempt it's on, when it'll retry next, its position in the queue. This is operational state, and it's normal for it to get cleaned up after a job finishes (removeOnComplete/removeOnFail).
A Postgres row, written by the application — pending → processing → completed/failed, plus the result or failure reason. This is what survives a Redis flush, what a client actually polls, and what you'd query for "show me every failed draft from last week."

export enum DraftJobStatus {
  PENDING = 'pending',
  PROCESSING = 'processing',
  COMPLETED = 'completed',
  FAILED = 'failed',
}

@Entity()
export class DraftJob {
  @PrimaryGeneratedColumn('uuid')
  id: string;

  @Column()
  topic: string;

  @Column({ type: 'varchar', default: DraftJobStatus.PENDING })
  status: DraftJobStatus;

  @Column({ type: 'text', nullable: true })
  result: string | null;

  @Column({ type: 'text', nullable: true })
  failureReason: string | null;

  @Column({ default: 0 })
  attemptsMade: number;

  @CreateDateColumn()
  createdAt: Date;

  @Column({ type: 'timestamp', nullable: true })
  completedAt: Date | null;
}

Conflating these two — treating BullMQ's Redis-backed job as the only record — means losing history the moment a job gets cleaned up, and gives clients nothing stable to poll against.

Wiring BullMQ into the Nest app

Before any of that, BullMQ needs a Redis connection — configured once, at the root module:

// app.module.ts
BullModule.forRootAsync({
  inject: [ConfigService],
  useFactory: (configService: ConfigService) => ({
    connection: {
      host: configService.get<string>('REDIS_HOST'),
      port: configService.get<number>('REDIS_PORT'),
    },
  }),
}),

Then, inside whichever feature module actually uses a queue, that queue gets declared by name:

// drafts.module.ts
BullModule.registerQueue({ name: 'draft-generation' }),

That string, 'draft-generation', is the thread tying three separate places together: it's what registerQueue declares here, what @InjectQueue('draft-generation') asks for in the service below, and what @Processor('draft-generation') listens on in the worker. All three have to use the exact same name — there's no compiler check enforcing that, since it's just a string. The two ways this can go wrong behave very differently, and it's worth knowing which is which rather than assuming:

If @InjectQueue doesn't match anything registerQueue declared, Nest's dependency injection fails loudly at startup with an UnknownDependenciesException — I confirmed this directly, and the error message names the exact missing provider token and suggests the fix. The app won't boot at all, so this mistake gets caught immediately.
If @Processor doesn't match — while @InjectQueue/registerQueue still agree with each other — the app boots without any error, jobs enqueue successfully, and then just sit in the queue forever. I verified this too: enqueued a job against a correctly-wired producer, waited, and checked the queue directly — the job's state was still waiting, indefinitely, with nothing logged anywhere to indicate why. No worker was ever listening on that name. This is the genuinely silent failure mode worth watching for, since nothing about it looks broken until someone notices a backlog that never drains.

Enqueueing a job

The job's payload type is defined right alongside the service that creates it — enqueue() and the worker (below) both depend on this shape:

export interface DraftJobData {
  draftJobId: string;
  topic: string;
  simulateFailures: number;
}

async enqueue(dto: CreateDraftDto) {
  const draftJob = await this.draftJobsRepository.save(
    this.draftJobsRepository.create({ topic: dto.topic, status: DraftJobStatus.PENDING }),
  );

  await this.draftQueue.add(
    'generate',
    { draftJobId: draftJob.id, topic: dto.topic, simulateFailures: dto.simulateFailures ?? 0 },
    {
      jobId: draftJob.id,
      attempts: 3,
      backoff: { type: 'exponential', delay: 2000 },
      removeOnComplete: { age: 3600 },
      removeOnFail: { age: 86400 },
    },
  );

  return { id: draftJob.id, status: draftJob.status };
}

A few specific choices worth calling out:

jobId: draftJob.id — using the Postgres row's own id as the BullMQ job id, rather than letting BullMQ generate one, is what makes re-enqueueing idempotent (more on this below).

backoff: { type: 'exponential', delay: 2000 } — each retry waits roughly double the previous gap, rather than hammering a struggling upstream API at a fixed interval. With attempts: 3 there are only ever two such gaps to observe, not three — I measured them directly against this exact config rather than trusting the formula from memory: ~2035ms after the first failure, ~4009ms after the second, and no fourth attempt ever fires once those three tries are exhausted. The delay keeps doubling if you raise attempts higher, but at attempts: 3 specifically, a third gap is never reachable — that ceiling comes from attempts, not from backoff alone.

removeOnComplete/removeOnFail — without these, BullMQ keeps every job's data in Redis indefinitely. These ages keep Redis from growing unbounded while still leaving failed jobs around longer (a day, vs. an hour for successes) since they're more likely to need debugging.

The worker

@Processor('draft-generation', { concurrency: 5 })
export class DraftGenerationProcessor extends WorkerHost {
  constructor(private readonly draftsService: DraftsService) {
    super();
  }

  async process(job: Job<DraftJobData>): Promise<string> {
    const { draftJobId, topic, simulateFailures } = job.data;
    await this.draftsService.markProcessing(draftJobId);

    // Throwing here is what tells BullMQ to retry, subject to `attempts`/`backoff`.
    return generateDraftContent(topic, job.attemptsMade, simulateFailures);
  }

  @OnWorkerEvent('completed')
  async onCompleted(job: Job<DraftJobData>, result: string) {
    await this.draftsService.markCompleted(job.data.draftJobId, result, job.attemptsMade);
  }

  @OnWorkerEvent('failed')
  async onFailed(job: Job<DraftJobData> | undefined, error: Error) {
    if (!job) return;
    const maxAttempts = job.opts.attempts ?? 1;
    if (job.attemptsMade >= maxAttempts) {
      await this.draftsService.markFailed(job.data.draftJobId, job.attemptsMade, error.message);
    }
    // else: still has retries left, BullMQ will reschedule it automatically
  }
}

@Processor(...) plus extending WorkerHost is the @nestjs/bullmq pattern for defining a worker — process() is the actual job handler, and @OnWorkerEvent hooks into BullMQ's lifecycle events. (The actual source file also logs each attempt via Nest's Logger, trimmed from the snippet above for readability — the logic shown is otherwise unchanged from what's in the repo.)

The detail that's easy to get backwards: `attemptsMade`

BullMQ's 'failed' event fires after every failed attempt, not just the last one. If you write the failure-handling logic without checking attempt count, a job that's about to succeed on its third try gets incorrectly marked failed after its first.

The fix is the job.attemptsMade >= maxAttempts check above — but getting that comparison right depends on knowing exactly what attemptsMade contains at each point, which isn't obvious from the type signature alone. I checked this directly against a running BullMQ instance rather than assuming:

attemptLog (job.attemptsMade as seen INSIDE the processor on each run):
[ { attemptsMade: 0, opts: 3 },
  { attemptsMade: 1, opts: 3 },
  { attemptsMade: 2, opts: 3 } ]
final 'completed' event attemptsMade: 3

attemptsMade is 0 on the very first execution, not 1 — it counts completed prior attempts, not the current attempt number. Inside process(), a job configured with attempts: 3 sees 0, 1, 2 across its three tries; the 'completed'/'failed' event handlers see it afterward, already incremented to 3. Getting this backwards (checking attemptsMade <= maxAttempts instead of >=, or assuming attempt 1 reads as 1 inside the processor) produces a retry condition that's off by exactly one — either giving up one attempt early, or never giving up at all.

What "idempotent" actually means here

Using the Postgres row's id as the BullMQ jobId means re-enqueueing the same id is a no-op — but it's worth being precise about what that actually does, rather than taking it on faith. I tested this directly:

await queue.add('task', { n: 1 }, { jobId: 'fixed-id' }); // creates the job
await queue.add('task', { n: 2 }, { jobId: 'fixed-id' }); // returns a Job object, but...

const stored = await queue.getJob('fixed-id');
console.log(stored.data); // { n: 1 } — the SECOND add() never took effect

The second add() call doesn't throw, and it doesn't error — it just silently does nothing. The job already in Redis under that id keeps its original data. This holds whether the original job is still waiting, actively processing, or has already completed (as long as it hasn't been cleaned up by removeOnComplete).

That's exactly the property you want for a reconciliation or retry path elsewhere in your own backend: if something upstream of enqueue() ever calls it twice for the same logical request — a retried HTTP call, a duplicate webhook delivery, a race in a distributed system — the second call doesn't create a second job, doesn't reprocess, and doesn't overwrite the first job's data with whatever the second call happened to pass. It's a much cheaper idempotency guarantee than building your own deduplication table, but it only works because the jobId is something stable and meaningful (the Postgres row's id) rather than an auto-generated one.

Setting up and running the project

Clone the repo, install dependencies, and copy the environment template:

git clone <your-repo-url>
cd bullmq-nestjs-demo
npm install
cp .env.example .env

This project needs both Postgres and Redis running locally — Postgres for the durable job records, Redis for BullMQ's own queue state. The fastest path for both is Docker:

docker run --name jobs-postgres \
  -e POSTGRES_PASSWORD=postgres -e POSTGRES_DB=jobs_demo \
  -p 5432:5432 -d postgres:16

docker run --name jobs-redis -p 6379:6379 -d redis:7

The defaults in .env.example already line up with those two containers, so you shouldn't need to edit .env if you used them as-is. Then start the API:

npm run start:dev

synchronize: true is enabled in app.module.ts for this demo, so the draft_job table is created automatically on first boot — no manual migration step needed to follow along. (Turn that off and switch to real migrations before deploying anywhere real.) The worker runs inside this same process for simplicity here; see the production notes below on why you'd typically split it out.

With the server up on http://localhost:3000, you're ready to run through the flow below.

Testing it end-to-end

This is the actual sequence run against a live server, real Postgres, and real Redis before publishing — happy path, retry-then-succeed, permanent failure, and the validation/not-found edge cases:

# Happy path — no simulated failures
curl -X POST http://localhost:3000/drafts \
  -H "Content-Type: application/json" \
  -d '{"topic":"NestJS background jobs"}'
# -> { "id": "...", "status": "pending" }

curl http://localhost:3000/drafts/<id>
# -> { "status": "completed", "attemptsMade": 1, "result": "..." }

# Forces 2 failures before success — watch attemptsMade end at 3,
# with ~2s then ~4s of exponential backoff between attempts
curl -X POST http://localhost:3000/drafts \
  -H "Content-Type: application/json" \
  -d '{"topic":"Retry demo","simulateFailures":2}'

curl http://localhost:3000/drafts/<id>
# (poll a few times over ~6-8s)
# -> { "status": "completed", "attemptsMade": 3, "result": "..." }

# More failures than the configured 3 attempts allow — exercises the
# permanent-failure path instead
curl -X POST http://localhost:3000/drafts \
  -H "Content-Type: application/json" \
  -d '{"topic":"Permanent failure demo","simulateFailures":5}'

curl http://localhost:3000/drafts/<id>
# -> { "status": "failed", "attemptsMade": 3, "failureReason": "Simulated upstream failure..." }

# Validation and not-found paths, checked the same way
curl -X POST http://localhost:3000/drafts \
  -H "Content-Type: application/json" \
  -d '{"topic":"a"}'
# -> 400 Bad Request — topic is shorter than the DTO's @MinLength(3)

curl http://localhost:3000/drafts/00000000-0000-0000-0000-000000000000
# -> 404 { "message": "Draft job not found" }

Every path above — immediate success, eventual success after retries, permanent failure after exhausting retries, a rejected validation error, and an unknown id — was checked against the actual status, attemptsMade, and HTTP status code returned by a live server, not just "the request didn't error." The duplicate-jobId idempotency behavior from the previous section was verified the same way, against this same running instance, using a small script that called queue.add() directly rather than going through the HTTP API (since triggering a raw duplicate enqueue isn't something a normal client request can do on its own).

Production notes

A few things were deliberately simplified for this demo and are worth tightening before shipping:

Run the worker as a separate process from the API. Here, the processor lives inside the same NestJS app as the controller — fine for a demo, but in production a traffic spike on the HTTP side will compete for CPU with job processing unless they're split into independent, independently-scalable deployments.
Distinguish retryable from non-retryable errors. A timeout or a 429 should retry. A 401 from a bad API key will fail identically on every attempt — throw new UnrecoverableError(...) (exported by BullMQ) skips the remaining retries instead of wasting them on a guaranteed failure.
Size concurrency to what the upstream and your database can actually sustain. concurrency: 5 here is a demo default, not a number derived from real capacity.
Monitor queue depth, not just individual job outcomes. A queue that's silently backing up faster than it's draining is a different failure mode than any single job failing, and won't show up by looking at one job's status at a time.
There's no authentication on these endpoints. This demo is scoped to the queueing mechanics, not access control — anyone who can reach POST /drafts can enqueue work, and anyone who knows (or guesses) a job id can read its result. If this sits behind a real API, put it behind the same kind of auth boundary as anything else you wouldn't want publicly writable (a prior post on this blog covers building TOTP-based 2FA in NestJS, if that's useful context for the kind of guard logic involved).

Source code

The complete, tested implementation — NestJS module, the worker, entity, DTOs, and a README with the full setup and curl walkthrough — is available as a standalone repository: bullmq-nestjs-demo. Clone it, point it at your own Postgres and Redis, and the enqueue → retry → status-poll flow above works out of the box.

Originally published on ZyVOP

💡 For more articles like this, subscribe to the ZyVOP newsletter!

AI’s Growing Pains: Compute Caps, Security Harnesses, and the Human Touch | The AI Daily Roundup

ZyVOP — Thu, 02 Jul 2026 11:18:59 +0000

Connecting the Dots: AI Is Hitting Its Operational Limits

Across the headlines, a single narrative emerges: the AI boom is moving from raw model hype to the gritty reality of deployment. Companies are now wrestling with three intertwined constraints—compute scarcity, security/privacy tooling, and the need for human expertise. The stories below illustrate how each pressure point is reshaping the ecosystem.

1. Compute Scarcity Becomes a Competitive Weapon

Google limits Meta’s use of Gemini shows that even the largest cloud providers cannot guarantee unlimited GPU capacity. Meta’s request for additional compute was denied, forcing the social‑media giant to tighten token usage and delay internal projects. This is a concrete reminder that AI scaling is bounded by physical hardware, not just capital.

In response, Austria is lobbying the EU to host Anthropic. By relocating critical AI workloads to Europe, Anthropic hopes to sidestep US export curbs and secure a more predictable compute pipeline. The geopolitical maneuver underscores that access to compute is becoming a strategic asset for AI firms.

Beneficiaries: Cloud providers that can guarantee capacity, regions investing in AI‑focused data centers.
Losers: Companies that depend on a single provider’s surplus capacity (e.g., Meta’s delayed projects).

2. Security Harnesses Trump Raw Model Power

Semgrep’s benchmark shows GLM‑5.2 beating Claude when only the model is considered, but the report also highlights that a purpose‑built harness can lift performance from 39% to over 50% F1. The takeaway for CTOs is clear: the surrounding pipeline—code ingestion, output parsing, feedback loops—often determines real‑world security outcomes more than the model itself.

Parallel concerns appear in the OpenAI Codex ignore‑file request. Developers demand deterministic mechanisms (.codexignore) to keep sensitive files out of model prompts, a feature that is essentially a security harness at the data‑access layer. As AI agents become more autonomous, guardrails built into the tooling stack become non‑negotiable.

Beneficiaries: Vendors offering end‑to‑end AI security platforms (e.g., Semgrep, CodeQL, specialized harness frameworks).
Losers: Teams that rely solely on “plug‑and‑play” models without investing in integration engineering.

3. Human Expertise Remains the Safety Net

Ford’s decision to re‑hire veteran engineers after AI‑driven quality systems fell short is a cautionary tale. The automaker discovered that AI alone could not guarantee the precision required on the assembly line, prompting a hybrid model where seasoned engineers train and audit the AI tools. This mirrors the broader industry realization that AI augments, not replaces, domain experts.

In academia, Professor Roberto Serrano’s exposure of a massive cheating scandal at Brown (El Pais article) illustrates the opposite side: unchecked AI access can erode trust in institutions. The incident forces universities to rethink assessment design, detection tools, and policy—again, a human‑centric response to AI misuse.

Beneficiaries: Companies that blend AI with skilled personnel (e.g., Ford, security firms with expert‑in‑the‑loop models).
Losers: Organizations that attempted to replace human oversight entirely, risking quality or credibility lapses.

4. Macro‑Level Risks and Market Signals

Central bankers warning of an AI‑driven financial crash adds a macroeconomic dimension. When compute scarcity drives up token prices, and when security incidents force costly mitigations, the sector’s cash burn can outpace revenue, threatening broader financial stability.

Investors should watch for signs of “AI‑infrastructure debt” – companies that have over‑promised AI capabilities without securing the underlying compute, security, or talent foundations.

5. Grassroots Tooling and the DIY Ethos

On the developer front, projects like Bash4LLM+ show a push for lightweight, language‑agnostic interfaces to LLMs. While these tools democratize access, they also amplify the earlier themes: without proper harnesses and security policies, even a single‑line Bash script can inadvertently leak proprietary code or sensitive data.

Similarly, personal experiments such as using Claude Code for a second‑opinion MRI (Antoine’s blog) highlight the allure of AI in niche domains, yet they also expose liability gaps that regulators will soon address.

6. The Emerging Playbook for Leaders

For senior engineers, CTOs, and investors, the actionable takeaways are:

Secure compute pipelines. Diversify providers, explore regional data‑center partnerships, and budget for premium capacity.
Invest in harness engineering. Build or adopt frameworks that handle data sanitization, prompt engineering, and result validation.
Maintain human‑in‑the‑loop checkpoints. Especially for safety‑critical or high‑trust applications (automotive, healthcare, finance).
Monitor regulatory and geopolitical shifts. US export controls, EU hosting incentives, and central‑banker warnings will shape market dynamics.

Companies that internalize these constraints will turn today’s growing‑pains into a competitive moat; those that ignore them risk costly rollbacks, compliance penalties, or outright project failure.

Originally published on ZyVOP

💡 For more articles like this, subscribe to the ZyVOP newsletter!

Building a Production AI Agent in Node.js: Tool Calling, the ReAct Loop, and Error Handling

ZyVOP — Wed, 01 Jul 2026 14:52:26 +0000

Most agent tutorials stop at a toy. A bot that checks the weather, a script that answers one question, then a victory lap in the README.

None of that prepares you for what happens when a tool throws an error, the model calls a function ten times in a row, or you blow past your rate limit mid-conversation. This post builds the other kind of agent, running on Groq.

Groq is worth a look for this specifically: an OpenAI-compatible API, no credit card to start, and inference fast enough that the agent loop feels instant instead of laggy. No framework either — just the SDK and a loop you can read top to bottom. By the end you'll have a customer-support agent that looks up orders, does math, and searches a FAQ, plus the iteration caps, retries, and error handling that keep it from falling over once real traffic hits it.

Full source, including the test suite, is linked at the bottom.

What "production" actually means here

An agent isn't a chatbot with extra steps. It reasons about a task, picks a tool, looks at the result, and decides what to do next, looping until it has an answer. That loop is usually called ReAct: reason, act, observe, repeat.

The demo version of this loop assumes everything goes right. The production version has to handle:

A tool that throws, times out, or returns garbage
A model that keeps calling tools and never wraps up
A 429 or a 500 from the API itself, which is different from a tool failure
A model that sends back malformed arguments for a tool call
Needing to know what happened after the fact, not just whether it worked

Each of those gets its own piece of the code below.

Why Groq, and why no framework

Groq doesn't train its own models. It runs open-source ones (Llama, GPT-OSS, Qwen, and others) on custom LPU hardware built for inference speed.

The free tier needs no credit card and gives you every model, gated only by rate limits: roughly 30 requests a minute and a few thousand tokens a minute per model, enforced per account rather than per key. That's plenty for a demo and for prototyping. It's not enough for real user traffic without upgrading.

Because Groq's API mirrors OpenAI's chat completions format, you don't need LangChain.js or any other framework to get tool calling working. The groq-sdk package gives you the same loop with nothing hidden, which matters when something breaks and you need to know exactly where.

The loop itself is under 100 lines. You're about to read all of them.

Defining a tool

Groq's tool format wants a type: "function" wrapper, a name, a description the model reads to decide when to use the tool, and a JSON schema for the arguments. Here's the calculator tool in full:

// src/tools/calculator.js
import { safeCalculate } from "../lib/safeCalculate.js";

export const calculatorTool = {
  definition: {
    type: "function",
    function: {
      name: "calculate",
      description:
        "Evaluate a basic arithmetic expression. Supports +, -, *, /, ^, parentheses, and decimals. " +
        "Always use this instead of doing math yourself, including for things like order totals or discounts.",
      parameters: {
        type: "object",
        properties: {
          expression: {
            type: "string",
            description: "A math expression, e.g. '(42 + 8) * 3' or '199.99 * 0.85'"
          }
        },
        required: ["expression"]
      }
    }
  },

  async handler({ expression }) {
    const result = safeCalculate(expression);
    return String(result);
  }
};

That description line is doing real work. It tells the model to reach for this tool instead of computing totals itself, which matters because language models are unreliable at arithmetic. The tool use overview covers schema design in more depth if you're writing tools with nested objects or several optional fields.

One thing worth flagging: safeCalculate is a small hand-written parser, not eval(). The expression comes from the model, and the model's input ultimately traces back to whatever the user typed.

Piping that into eval() or new Function() means a sufficiently creative prompt is now running arbitrary code on your server. The repo's calculator does the math with a real tokenizer and recursive-descent parser instead — more code, zero risk.

The other two tools follow the same shape: get_order_status looks up a mock order by ID (with a deliberately broken ORD-FAIL entry for testing failure handling), and search_knowledge_base does keyword search over a small FAQ array. Full source for both is in the repo.

The loop itself

This is the part that drives the agent. It lives in src/agent.js:

async run(userMessage, { history = [], onStep } = {}) {
  const messages = [...history, { role: "user", content: userMessage }];
  const trace = [];
  const usage = { inputTokens: 0, outputTokens: 0 };

  for (let step = 0; step < this.maxIterations; step++) {
    const response = await this.callModel(messages);
    usage.inputTokens += response.usage?.prompt_tokens ?? 0;
    usage.outputTokens += response.usage?.completion_tokens ?? 0;

    const choice = response.choices[0];
    const message = choice.message;

    if (choice.finish_reason !== "tool_calls" || !message.tool_calls?.length) {
      const text = (message.content ?? "").trim();
      messages.push({ role: "assistant", content: message.content ?? "" });
      const finalStep = { step, type: "final", text };
      trace.push(finalStep);
      onStep?.(finalStep);
      return { text, steps: step + 1, usage, trace, history: messages, truncated: false };
    }

    // Groq requires the assistant's tool_calls message echoed back verbatim
    // before the matching tool results — this is the OpenAI-style contract.
    messages.push({ role: "assistant", content: message.content, tool_calls: message.tool_calls });

    for (const toolCall of message.tool_calls) {
      const result = await this.executeTool(toolCall, step);
      trace.push(result.traceEntry);
      onStep?.(result.traceEntry);
      messages.push(result.toolMessage);
    }
  }

  const truncatedStep = { step: this.maxIterations, type: "max_iterations_reached" };
  trace.push(truncatedStep);
  onStep?.(truncatedStep);

  return {
    text: "I wasn't able to finish this within the step limit. Here's what I found before stopping.",
    steps: this.maxIterations,
    usage,
    trace,
    history: messages,
    truncated: true
  };
}

Five things worth calling out:

The iteration cap is the for loop's bound, not an afterthought. Without it, a model stuck in a reasoning loop — or a tool that always returns "try again" — burns through your rate limit until something else stops it. That matters more on a free tier than a metered one: you don't lose money, you lose your remaining requests for the minute, and maxIterations defaults to 8 here to keep that from happening.

finish_reason is checked, not just whether tool_calls exists. Groq's response can technically include a tool_calls array while finish_reason says something else (truncated output, for instance), so the loop checks both before deciding the model wants a tool run.

Tool failures don't throw past this function. executeTool (below) catches whatever the tool does and turns it into a normal tool role message. The model sees the failure as part of the conversation and can retry with different input, try another tool, or tell the user it couldn't complete the request.

history comes in and goes out, and it never contains the system message. The system prompt gets injected fresh inside callModel on every call instead of being stored in history. Otherwise a long conversation would accumulate a duplicate system message every turn.

Every step gets logged through onStep, whether it's a tool call or the final answer. The CLI uses this to print tool calls in verbose mode; a real deployment would pipe it to whatever you use for logging.

Tool execution looks like this:

async executeTool(toolCall, step) {
  const name = toolCall.function.name;
  const handler = this.findHandler(name);
  const startedAt = Date.now();
  let content;
  let isError = false;

  try {
    if (!handler) throw new Error(`Unknown tool: ${name}`);

    let input;
    try {
      input = JSON.parse(toolCall.function.arguments || "{}");
    } catch {
      throw new Error(`Model sent invalid JSON arguments for "${name}"`);
    }

    content = await this.withTimeout(handler(input), this.toolTimeoutMs);
  } catch (err) {
    isError = true;
    content = `Tool error: ${err.message}`;
    this.logger.warn?.(`[agent] tool "${name}" failed: ${err.message}`);
  }

  const durationMs = Date.now() - startedAt;

  return {
    traceEntry: { step, type: "tool_call", tool: name, input: toolCall.function.arguments, result: content, isError, durationMs },
    toolMessage: { role: "tool", tool_call_id: toolCall.id, content: String(content) }
  };
}

Two details here are specific to Groq's OpenAI-style format and don't show up if you're used to Anthropic's tool-calling shape. First, toolCall.function.arguments arrives as a JSON string, not an already-parsed object. The model can and occasionally will send back something that doesn't parse, so that JSON.parse is wrapped in its own try/catch with a clear error message rather than letting it throw a raw SyntaxError up the stack.

Second, there's no is_error flag on the result message the way Anthropic's tool_result blocks have one. A failed tool just returns its error as text inside a normal tool role message, and the model reads it like any other result.

The timeout matters as much as the try/catch. A tool that calls a flaky downstream API can hang for a long time if you let it; withTimeout races the handler against a timer and turns a hang into a clean error after 10 seconds by default.

API-level failures get separate handling from tool failures, because they mean different things. A 429 or a 500 from Groq's API is transient: retrying with backoff usually fixes it. A 401 means your API key is wrong, and retrying does nothing but waste a request:

async callModel(messages, attempt = 0) {
  const maxRetries = 3;
  try {
    return await this.client.chat.completions.create({
      model: this.model,
      max_completion_tokens: this.maxTokens,
      messages: [{ role: "system", content: this.systemPrompt }, ...messages],
      tools: this.toolDefinitions(),
      tool_choice: "auto"
    });
  } catch (err) {
    const status = err?.status;
    const retryable = RETRYABLE_STATUS.has(status); // 408, 409, 429, 500, 502, 503, 504
    if (!retryable || attempt >= maxRetries - 1) throw err;
    const delayMs = 500 * 2 ** attempt;
    this.logger.warn?.(`[agent] API call failed (status ${status}), retrying in ${delayMs}ms`);
    await sleep(delayMs);
    return this.callModel(messages, attempt + 1);
  }
}

One more Groq-specific detail: the groq-sdk client retries some of these statuses on its own, twice by default. The client here is constructed with maxRetries: 0 so the SDK's built-in retry doesn't stack on top of this one. Without that, a single rate-limited call could silently balloon to nine real HTTP attempts (three of mine, each retried three times by the SDK) instead of three.

Wiring it up

src/server.js exposes the agent as POST /api/agent/chat, rate-limited with express-rate-limit at a level that sits comfortably under Groq's free-tier cap even with a few concurrent users. Sessions live in an in-memory Map for the demo — fine for trying this out, not fine once you have more than one server process, at which point Redis is the obvious swap.

src/cli.js is the same agent wired to a terminal readline loop instead, useful for testing prompts and tool behavior interactively without standing up a server.

Both read a GROQ_API_KEY from the environment and default to openai/gpt-oss-120b as the model, overridable with GROQ_MODEL. Get a key at console.groq.com/keys — no credit card needed.

Trying it over HTTP

Here's what hitting the running server looks like. The JSON envelope below is exact: those are the literal field names server.js returns, but the reply text, steps count, and usage numbers will vary somewhat each time you run this, since it's a live model and not a fixture.

A normal order lookup:

curl -X POST http://localhost:3000/api/agent/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "What is the status of ORD-1001?"}'

{
  "sessionId": "a1b2c3d4-5678-90ab-cdef-1234567890ab",
  "reply": "Order ORD-1001 has shipped via UPS, tracking number 1Z999AA10123456784. It's estimated to arrive July 2, 2026.",
  "steps": 2,
  "truncated": false,
  "usage": { "inputTokens": 612, "outputTokens": 47 }
}

steps: 2 means the model called a tool once, then answered — exactly the loop from earlier. Now the same question about the order rigged to fail, reusing the sessionId from above to stay in the same conversation:

curl -X POST http://localhost:3000/api/agent/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Can you check ORD-FAIL for me?", "sessionId": "a1b2c3d4-5678-90ab-cdef-1234567890ab"}'

{
  "sessionId": "a1b2c3d4-5678-90ab-cdef-1234567890ab",
  "reply": "I wasn't able to look up that order just now — the lookup service timed out. Could you try again in a moment?",
  "steps": 2,
  "truncated": false,
  "usage": { "inputTokens": 701, "outputTokens": 39 }
}

That's the tool-error path from earlier, end to end: the get_order_status handler threw, executeTool caught it, the model got the failure as a normal message instead of a crash, and it answered like a person would — not a stack trace in sight. And the multi-tool case, in one request:

curl -X POST http://localhost:3000/api/agent/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "What is the status of ORD-1001, and if I return it how much of the $89.99 do I get back after a 15% restocking fee?"}'

{
  "sessionId": "f0e1d2c3-4b5a-69d8-c7e6-0a1b2c3d4e5f",
  "reply": "ORD-1001 has shipped via UPS and should arrive by July 2, 2026. If you return it, you'd get back $76.49 after the 15% restocking fee.",
  "steps": 3,
  "truncated": false,
  "usage": { "inputTokens": 845, "outputTokens": 61 }
}

steps: 3 here — one call to get_order_status, one to calculate, then the final answer combining both. That $76.49 isn't the model doing arithmetic in its head: calculate actually runs 89.99 * 0.85 through the real parser from earlier (which returns 76.49149999999999, full floating-point precision and all), and the model rounds that to cents for a reply a person can read.

Testing it without spending a request

The agent loop takes a client in its constructor. In production that's a real Groq instance. In tests, it's a plain object with a scripted chat.completions.create() that returns whatever response you tell it to:

test("handles malformed tool-call arguments from the model without crashing", async () => {
  const client = makeScriptedClient([
    {
      choices: [{
        finish_reason: "tool_calls",
        message: {
          role: "assistant",
          content: null,
          tool_calls: [{ id: "call_3", type: "function", function: { name: "calculate", arguments: "{not json" } }]
        }
      }],
      usage: { prompt_tokens: 10, completion_tokens: 5 }
    },
    {
      choices: [{ finish_reason: "stop", message: { role: "assistant", content: "Sorry, something went wrong." } }],
      usage: { prompt_tokens: 12, completion_tokens: 6 }
    }
  ]);

  const agent = new Agent({
    client,
    model: "openai/gpt-oss-120b",
    tools: [calculatorStub],
    systemPrompt: "test",
    logger: silentLogger
  });

  const result = await agent.run("Send a malformed tool call");

  const toolCallStep = result.trace.find((step) => step.type === "tool_call");
  assert.equal(toolCallStep.isError, true);
  assert.match(toolCallStep.result, /invalid JSON arguments/);
});

That test exists because it's a real failure mode in this format, not a hypothetical one: Groq's models occasionally send back arguments that don't parse cleanly, and the only way to find that out before a user does is to write the test that assumes it'll happen.

The full suite (21 tests across the loop, the calculator parser, and the tools) runs on Node's built-in test runner — no Jest, no extra dependency:

npm test

What the mocked tests actually catch

Worth being precise about what's verified here and what isn't. The 21 tests run against a scripted mock client, so they prove the loop's behavior: a 429 gets retried with backoff, a 401 doesn't get retried at all, a tool that throws gets turned into a clean error message instead of crashing the process, and malformed tool arguments don't take down the server either.

None of that touches Groq's actual servers, and it doesn't need to — that's the point of mocking the client. What it can't tell you is whether your specific model picks the right tool for a given prompt, or how it behaves under real latency.

For that, the repo ships a small smoke-test script. Start the server, then run it against a live key:

npm start                          # terminal 1
bash scripts/live-smoke-test.sh    # terminal 2

It runs nine checks in sequence: a normal order lookup, an unknown order ID, the simulated outage on ORD-FAIL, the calculator, the FAQ search, one turn that needs two tools at once, a follow-up that checks session memory, an invalid request, and a burst of twelve rapid-fire calls to confirm the local rate limiter kicks in. Watching that run once tells you more about how your model handles tool selection than any number of mocked tests can. On Groq's free tier, it costs nothing but a couple of minutes.

If you'd rather watch it think in real time, npm run cli -- --verbose prints every tool call as it happens, which is the faster way to see why the model reached for a given tool.

Taking it further

This is a teaching example. Before anything like it goes near real users:

Move sessions from the in-memory Map to Redis or a database
Add auth in front of the chat endpoint: right now anyone who can reach the port can spend your rate limit
Watch for 429s under real load: the free tier's per-minute cap is easy to hit with more than a couple of concurrent users, and Groq's Developer tier (still no minimum spend) raises that ceiling
Trim or summarize long conversation history before it eats your context window
Replace the mock order and FAQ data with real lookups

Get the code

Full source, with the test suite and a setup guide, is on GitHub: https://github.com/zyvop27-cmyk/zyvop-blogs/tree/main/ai-agent-node

Clone it, drop in your own GROQ_API_KEY, and ask it about an order. The agent loop, the tools, and the error handling are all small enough to read in one sitting — which is the point.

Originally published on ZyVOP

💡 For more articles like this, subscribe to the ZyVOP newsletter!