DEV Community

NexGenData
NexGenData

Posted on

Reverse-Engineering Your Competitor's Tech Stack at Scale (Without Paying BuiltWith)

Reverse-Engineering Your Competitor's Tech Stack at Scale (Without Paying BuiltWith)

If you are a founder, a CTO, or a product strategist, your competitor's GitHub commits are the most expensive document in their company that you cannot read. The second most expensive is their tech stack — and that one you can read, because it ships in the response of every HTTPS request to their public site.

A competitor that swapped Mixpanel for PostHog, Heroku for Fly.io, Algolia for Typesense, or Segment for self-hosted RudderStack has just made a public infrastructure decision after weeks or months of internal evaluation. They paid the procurement cost, the migration cost, the engineering-time cost, and the integration cost, and they decided the new tool was worth it. That decision is information.

If you are evaluating the same swap, their decision is signal — they have already done the work you are about to do. If you are competing against them, their decision is intelligence — the new tool is better fit for at least one of you. If you are building a tool that targets users of either, their decision is pipeline — prospects who just migrated have an 18-month "we just chose this" honeymoon, and prospects who got migrated away from are in a renewal-anger window.

This post is the under-$100/month pipeline for tracking competitor tech-stack changes over time, at the cadence of a daily snapshot. The reference implementation uses the wappalyzer-replacement actor for the detection layer, plus ~120 lines of Python for the diff layer. BuiltWith and SimilarTech offer the same thing as a SaaS product — for $295–$995/month per seat, with seat-based licensing that makes it expensive to share with the eight people on your team who would actually use the data.

Why competitor stack data is structurally cheap to collect

Public web infrastructure is, by design, observable. Your competitor's site responds to HTTPS requests with HTTP headers, HTML, JavaScript, and resource URLs. Each of those carries fingerprints of the underlying technology:

  • A Server: cloudflare header confirms Cloudflare is in front.
  • A <script src="https://cdn.segment.com/analytics.js/v1/..."> tag confirms Segment.
  • A window.dataLayer JavaScript global after page load confirms Google Tag Manager.
  • A cf-ray: header value confirms Cloudflare's edge network.
  • A <meta name="generator" content="Webflow"> confirms Webflow as the CMS.

Unlike scraping a logged-in product surface or guessing at internal services, fingerprinting the marketing site is fully sanctioned: the data is being broadcast publicly to every visitor, and the act of "visit a website and inspect what loaded" is what every browser already does. You are not bypassing access controls — you are reading the response.

The cost of doing this for one competitor is trivial. The cost of doing it for 50 competitors, daily, with a diff layer that surfaces changes, is what differentiates a real intel pipeline from a screenshot in a Notion doc.

The four signal classes

Before we build, it is worth being clear about what changes you can detect and what they mean.

1. Net additions. A new technology appears that was not in yesterday's snapshot. Examples: PostHog appears, Stripe Tax appears, Sentry appears. Net additions are the highest-signal events — they imply the competitor has a new active project, often a new product surface, often a new team.

2. Net removals. A technology disappears. Examples: Segment is gone, Optimizely is gone, Heroku-* headers are gone. Net removals are second-highest signal — they imply a vendor swap, a renewal that did not renew, or a deprecation. Removals are also more reliable than additions because tools tend to load on every page once integrated; a single missing detection across 5 daily samples is real.

3. Version bumps. Same technology, new version. React 17 → React 18, Next.js 12 → 14, WordPress 6.2 → 6.4. Version bumps are noisy because most teams update on a long tail; they are useful only in aggregate ("competitor adopts new major version of X within N weeks of release").

4. Category shifts. The same job is now done by a different tool. Yesterday: Heap (analytics category). Today: Mixpanel. The category did not change but the vendor did. These are the most actionable competitive intelligence events because they reflect a deliberate vendor evaluation.

A good pipeline surfaces all four classes and tags each event by signal strength.

Architecture

Daily cron
   |
   v
Fetch tech stack for N competitor domains  --->  wappalyzer-replacement actor
   |
   v
Snapshot to JSON-per-domain in dated folder
   |
   v
Diff today's snapshot vs. yesterday's per domain
   |
   v
Classify diffs (addition / removal / version-bump / category-shift)
   |
   v
Slack / email digest of significant changes
Enter fullscreen mode Exit fullscreen mode

The whole pipeline is a single Python script triggered by cron, plus an S3 bucket (or Git repo) for the snapshot history.

Step 1: Define the watchlist

A simple YAML file lists the competitors and any optional metadata. The metadata is useful for routing alerts ("if a direct competitor swaps a tool we sell against, page the GTM team; if an indirect competitor does the same, just log it").

watchlist:
  - domain: rival-saas-1.com
    tier: direct
    category: analytics
  - domain: rival-saas-2.com
    tier: direct
    category: analytics
  - domain: adjacent-vendor.com
    tier: indirect
    category: data-platform
  # ... 40-200 more
Enter fullscreen mode Exit fullscreen mode

For most B2B SaaS companies, the watchlist is 30–80 domains: 5–10 direct competitors, 20–40 adjacent vendors, 10–30 reference customers (companies you want to monitor for the case-study angle).

Step 2: Daily snapshot

import os
import json
import yaml
from datetime import date
from pathlib import Path
from apify_client import ApifyClient

client = ApifyClient(os.environ["APIFY_TOKEN"])

def snapshot_today(watchlist_path: str, snapshot_dir: str):
    with open(watchlist_path) as f:
        watchlist = yaml.safe_load(f)["watchlist"]

    urls = [f"https://{w['domain']}" for w in watchlist]
    run = client.actor("nexgendata/wappalyzer-replacement").call(
        run_input={"urls": urls, "render_js": True}
    )
    items = list(client.dataset(run["defaultDatasetId"]).iterate_items())

    today_dir = Path(snapshot_dir) / date.today().isoformat()
    today_dir.mkdir(parents=True, exist_ok=True)
    for item in items:
        domain = item["url"].split("//")[-1].rstrip("/")
        with open(today_dir / f"{domain}.json", "w") as f:
            json.dump(item["technologies"], f, indent=2)

snapshot_today("watchlist.yaml", "./snapshots")
Enter fullscreen mode Exit fullscreen mode

For a 50-domain watchlist this run completes in under 5 minutes and costs $0.50 per snapshot ($15/month at daily cadence).

Step 3: Diff against yesterday

Diffs are simple set operations on the technology names, plus a version comparison for technologies that report a version.

from datetime import date, timedelta

def diff_snapshots(domain: str, snapshot_dir: str):
    today = date.today()
    yesterday = today - timedelta(days=1)

    today_path = Path(snapshot_dir) / today.isoformat() / f"{domain}.json"
    yesterday_path = Path(snapshot_dir) / yesterday.isoformat() / f"{domain}.json"

    if not yesterday_path.exists():
        return {"status": "no_baseline"}

    today_techs = {t["name"]: t for t in json.load(open(today_path))}
    yesterday_techs = {t["name"]: t for t in json.load(open(yesterday_path))}

    added = set(today_techs) - set(yesterday_techs)
    removed = set(yesterday_techs) - set(today_techs)

    version_bumps = []
    for name in set(today_techs) & set(yesterday_techs):
        v_today = today_techs[name].get("version")
        v_yesterday = yesterday_techs[name].get("version")
        if v_today and v_yesterday and v_today != v_yesterday:
            version_bumps.append({"name": name, "from": v_yesterday, "to": v_today})

    return {
        "status": "diffed",
        "added": [{"name": n, "categories": today_techs[n]["categories"]} for n in added],
        "removed": [{"name": n, "categories": yesterday_techs[n]["categories"]} for n in removed],
        "version_bumps": version_bumps,
    }
Enter fullscreen mode Exit fullscreen mode

Step 4: Filter false positives

Daily detection is noisy. A site might temporarily fail to load a CDN, a CDN might briefly serve a different file, anti-bot defenses might serve you a different page than the prior day. The mitigation is requiring N consecutive samples before declaring a real change.

def confirmed_change(domain: str, snapshot_dir: str, days: int = 3):
    """Return only changes that have persisted for `days` consecutive snapshots."""
    today = date.today()
    diffs = [diff_snapshots_on(domain, snapshot_dir, today - timedelta(days=i))
             for i in range(days)]

    persistent_added = set.intersection(*({d["name"] for d in dd["added"]} for dd in diffs))
    persistent_removed = set.intersection(*({d["name"] for d in dd["removed"]} for dd in diffs))
    return {"added": persistent_added, "removed": persistent_removed}
Enter fullscreen mode Exit fullscreen mode

The 3-day rule cuts roughly 80% of noise. For high-value alerts (direct competitor in your category), drop to 2 days; for low-value (indirect competitor in adjacent category), require 5 days.

Step 5: Categorize the diff

A raw diff like added: [Klaviyo, Postscript] is less useful than added: [Klaviyo (marketing-automation)] — competitor previously had no marketing-automation tool. The categorization layer:

CATEGORY_OF_INTEREST = {
    "analytics": ["Mixpanel", "Amplitude", "Heap", "PostHog", "Segment", "Rudderstack"],
    "feature_flags": ["LaunchDarkly", "Split.io", "Optimizely", "Statsig"],
    "search": ["Algolia", "Typesense", "Meilisearch", "Elasticsearch"],
    "payments": ["Stripe", "Braintree", "Adyen", "Recurly"],
    "crm": ["HubSpot", "Salesforce", "Pipedrive"],
    "support": ["Intercom", "Zendesk", "Help Scout", "Front"],
    "errors": ["Sentry", "Rollbar", "Bugsnag", "Honeybadger"],
}

def classify_swap(added: set, removed: set):
    """Detect within-category swaps (e.g. Heap removed, Mixpanel added = analytics swap)."""
    swaps = []
    for category, tools in CATEGORY_OF_INTEREST.items():
        added_in_cat = {t for t in added if t in tools}
        removed_in_cat = {t for t in removed if t in tools}
        if added_in_cat and removed_in_cat:
            swaps.append({
                "category": category,
                "from": list(removed_in_cat),
                "to": list(added_in_cat),
            })
    return swaps
Enter fullscreen mode Exit fullscreen mode

A within-category swap is the highest-signal event in the entire pipeline. "Direct competitor X just swapped Mixpanel for PostHog" is a Slack-paging event for any analytics-tool vendor. It tells you: their evaluation chose your competitor (or your product) over the incumbent, the swap is fresh, and the prospect/competitor is in the post-evaluation honeymoon.

Step 6: Digest

The output is a daily Slack message or email, posted to the GTM channel:

COMPETITOR STACK CHANGES — 2026-05-18

DIRECT competitors (3 changes):
  rival-saas-1.com  +  Sentry (errors)
  rival-saas-2.com  -  Heap, +  Mixpanel  (SWAP: analytics)
  rival-saas-3.com  +  Stripe Tax (payments)

INDIRECT competitors (2 changes):
  adjacent-vendor.com  -  Segment (analytics)
  partner-prospect.com  +  HubSpot (crm)

Confirmed changes (>=3 day persistence). Full diff: snapshots/2026-05-18/diff.json
Enter fullscreen mode Exit fullscreen mode

A 5-line Python slack_sdk.WebClient.chat_postMessage call. The hard work was upstream.

What this costs

Roughly $30/month for a 100-domain watchlist at daily cadence:

  • Detection: 100 × 30 days × $0.01 = $30/month
  • Compute (running the diff script): negligible, fits in a free GitHub Actions cron
  • Storage (90-day rolling snapshot history at ~5KB per domain): ~14MB total, free in any S3 tier

Compare to the alternatives:

Tool Cost (100 domains daily) What you get
BuiltWith Pro $495/month flat (10k lookups, ~3 daily snapshots of 100 domains) Hosted UI, historical data back ~5 years, 70k fingerprints
SimilarTech Pro $290/month (limited domains) Hosted UI, marketing-tech focus
HG Insights Enterprise contract ($30k+/yr) ABM-tuned, sales seat licenses
Wappalyzer Enterprise $250–$5,000/month Hosted UI, daily snapshots
Build it yourself + actor $30/month + 1 engineer-day setup OSS ruleset, fully scriptable, pipes into your tools

The hosted tools win on UI and historical data. They lose on script-ability, custom logic, and per-seat economics. If your CTO and 4 engineers all want to run ad-hoc queries on the data, BuiltWith requires 5 seats; the actor pipeline requires zero.

Real intelligence patterns

Once the pipeline is running, a few specific patterns to look for:

The "same-quarter swap" pattern. Three or more direct competitors swap the same tool within a quarter. This is the strongest possible signal that the incumbent has lost the market and the new tool has won it. (When Segment lost most of mid-market analytics to Rudderstack and PostHog over 2024, this pattern was visible in stack-data weeks before it surfaced in any analyst report.)

The "marketing site vs. app subdomain" delta. Your competitor's www.competitor.com shows the public marketing stack. Their app.competitor.com shows the customer-facing product stack. The delta between the two — tools the marketing team uses but the product team does not, and vice versa — is informative. (A competitor whose marketing site loads HubSpot but whose app subdomain loads Pendo is running two different stacks for two different teams; that tells you something about their org structure.)

The "fresh subdomain" pattern. A new subdomain appears (e.g., ai.competitor.com or partners.competitor.com). Run the actor against the new subdomain to see what stack it shipped with. New surfaces often reveal new product directions before any marketing announcement.

The "stack drift" pattern. A competitor's stack does not change for 18 months. They are either in maintenance mode (good news for you) or they have a moat large enough that they don't need to evaluate alternatives (different read).

The "vendor pile-up" pattern. A competitor adds a tool but does not remove the previous one. Three months later you can see both Heap and Mixpanel on their site. This is usually a half-finished migration, which is intelligence about their internal velocity.

Customizing fingerprints for your category

The OSS Wappalyzer ruleset has 251 fingerprints, which covers the top 95% of mainstream technologies. For category-specific competitive intelligence — say, you want to detect every CDP, every reverse-ETL tool, every customer-data platform — you may need custom fingerprints.

The actor accepts custom fingerprints as input, in the standard Wappalyzer JSON format:

{
  "Hightouch": {
    "cats": [76],
    "scriptSrc": ["hightouch\\.com/.*\\.js"],
    "html": ["<script[^>]+hightouch\\.io"]
  },
  "Census": {
    "cats": [76],
    "scriptSrc": ["census\\.app\\.com/.*\\.js"]
  }
}
Enter fullscreen mode Exit fullscreen mode

Drop the JSON into the actor's custom_fingerprints input and your daily snapshot will detect those tools alongside the 251 bundled. Custom fingerprints are how you turn a generic stack-detection tool into a category-specific intel platform.

Operational guardrails

A few things that have bitten teams running this pipeline:

Respect the prospect's anti-bot defenses. If a competitor sits behind Cloudflare Bot Fight Mode at high security, you may be served a challenge page instead of the real site. The actor includes residential-proxy rotation and stealth defaults, which solves this for ~95% of sites. The remaining 5% will return inconsistent stacks and you should flag them as "low-confidence" in your pipeline.

Don't crawl beyond the homepage unless you mean to. Most stack signals are on the homepage. Crawling deeper increases cost without much marginal signal, and it looks more like scraping.

Be aware of detection limits. Wappalyzer-style fingerprinting tells you what is loaded on the public site. It does not tell you what runs in the competitor's backend, what runs internally, or what they bought but have not deployed. A competitor that signed a Snowflake contract last quarter will not appear as "running Snowflake" until something Snowflake-shaped loads on a public surface.

Filter your watchlist down. It is tempting to add every company you have ever wondered about. The signal-to-noise ratio of a 200-domain watchlist is much worse than a 30-domain watchlist of companies you actually care about. Start narrow.

Snapshot the diff, not just the state. Disk space is cheap; rebuilding a year of diff history from scratch is not. Persist daily diff JSON alongside the snapshot JSON.

What you do with the output

The pipeline emits a daily diff. The interesting work is what you do with it.

  • GTM team. Direct-competitor swaps in your category get routed to the AE who owns the competitive teardown for that account list. "Rival-1 just dropped Heap" is a battlecard update.
  • Product team. Within-category swaps inform your own evaluation cycle. "Three competitors moved off Segment to PostHog in Q1" is useful when your VP Eng is debating Segment renewal.
  • Marketing team. Tool-specific landing pages. If you sell a Mixpanel competitor and four direct competitors just adopted Mixpanel, your "Mixpanel alternatives" landing page just got more relevant.
  • Sales engineering. Pre-call research. Before any meeting with a prospect that is in your watchlist, the SE can pull the latest stack and walk in knowing the integration story.

Putting it together

The pipeline is 150 lines of Python, $30/month of compute, and one engineer-day of setup. It produces a daily intelligence feed that is, for most B2B markets, comparable to or better than what a $1,000/month BuiltWith Pro subscription gives you — plus full programmability, custom fingerprints, and zero seat licensing.

If you want to skip the build, the wappalyzer-replacement actor handles the detection layer end to end. Schedule it daily, persist the JSON, run a 30-line diff script. The competitive-intelligence channel is the highest-leverage Slack channel in your company, and standing it up is a Tuesday afternoon.


NexGenData publishes 195+ actors for competitive intelligence and market research workflows. Pay-per-result, no contracts, no per-seat fees.

Related actors for the competitive-intelligence stack:

  • company-data-aggregator — eight-source OSINT aggregator (WHOIS, DNS, CT logs, GitHub, npm, tech headers) for full competitor profiling beyond just the marketing site.
  • jobs-tech-stack-extractor — scans competitor job postings for tech-stack mentions, complementing runtime detection with hiring-intent signals (often 6–12 months ahead of public deployment).
  • shopify-store-directory — for ecommerce competitive analysis: pulls Shopify stores by category with estimated monthly revenue bands, pairs with stack fingerprinting.

Top comments (0)