DEV Community

NexGenData
NexGenData

Posted on

Wappalyzer Paywalled Itself in 2023. Here's the OSS-Powered Replacement

Wappalyzer Paywalled Itself in 2023. Here's the OSS-Powered Replacement

Wappalyzer died on May 24, 2023. Here's the playbook.

"Died" is a little unfair — Wappalyzer, the company, is very much alive. What died was the open-source project that underpinned roughly a decade of technology-stack fingerprinting across the web. On that date in May 2023, the maintainers announced the fingerprint database was going closed-source. The browser extension stopped updating. The GitHub repository was frozen. The community-contributed technologies/*.json files — the actual ruleset that detected Shopify, Next.js, jQuery, and thousands of others — were removed from the public repo and folded into Wappalyzer's paid SaaS product.

The paid product itself is not unreasonable for enterprises: $250/month for the Team tier at 500 lookups/month, up to $5,000/month for the agency tier at 50,000/month. The problem is that for the tens of thousands of developers who were running the open-source CLI locally, scripting the extension, or self-hosting the detection library inside their own OSINT tools, this was a hard break. The replacement commercial offerings — BuiltWith at $295-$995/month, HG Insights at low-five-figure annual contracts, SimilarTech on comparable enterprise pricing — were not economically viable for the "I want to know what tech stack 10k domains use" scripting workflow that Wappalyzer OSS had served.

Meanwhile, the pre-paywall fingerprint rulesets survived on GitHub forks, and the broader OSS community picked them back up. The enthec/webappanalyzer, tunetheweb/wappalyzer, and dochne/wappalyzer forks are all actively maintained as of 2026, with 251 technology fingerprints in the current bundled ruleset and a steady contribution stream.

This post is a migration guide from paywalled Wappalyzer to a Puppeteer-powered, OSS-fingerprint-driven replacement. The reference implementation is the wappalyzer-replacement actor, which bundles the current OSS ruleset, runs Puppeteer for JavaScript-execution fingerprinting, and returns Wappalyzer-compatible JSON at $0.01 per tech-detection call — 100x cheaper than BuiltWith's enterprise tier at comparable accuracy.

Fingerprint counts and pricing are current as of Q2 2026; check each vendor for live rates.

Why fingerprinting is harder than it looks

Technology fingerprinting sounds trivial — look for jQuery in the page source, detect jQuery. In practice the space is full of traps.

Minified and bundled code hides framework signatures. Modern sites don't ship jquery-3.6.0.js. They ship vendor.a7f2c9e.js — a Webpack bundle that has jQuery, lodash, date-fns, and twelve other libraries concatenated and minified. Your fingerprint must match obscure bytecode patterns that survive minification, not readable tokens.

JavaScript-first SPAs don't render anything server-side. Fetching the HTML for a React or Vue app gets you a <div id="root"></div> and a script tag. Every piece of interesting tech-stack information — the framework, the UI library, the analytics tags, the A/B testing tools — materializes only after JavaScript executes. A pure curl + regex fingerprinter misses most of the modern web.

Tech stacks have dozens of leakage points and only some are reliable. HTTP headers, cookies, meta tags, script URLs, global JavaScript variables, inline JSON blobs, favicon hashes, font CDN URLs, DOM attributes, CSS class prefixes, XHR response signatures. Each source has different reliability. X-Powered-By: Express is definitive; presence of a class named ant-btn is strong but not definitive evidence of Ant Design; a <meta name="generator" content="WordPress 6.4"> is trivially spoofable.

Fingerprints go stale fast. Libraries rev, CDNs rotate URLs, frameworks rename their global variables. A rule that matched React 17 might not match React 18. A rule looking for window.__NEXT_DATA__ worked until Next.js 13 started using App Router and stopped emitting it on some routes. Maintained fingerprints matter.

Anti-bot systems block naive fetchers. Cloudflare Bot Fight Mode, PerimeterX, DataDome, and Akamai Bot Manager will happily return a challenge page to your unauthenticated Puppeteer instance. You need residential proxies, stealth plugins, or both, before you can even get to the fingerprinting phase.

Wappalyzer, to its credit, handled most of these. The replacement has to handle them too.

Who still needs this

The use cases have not gone away even as the tooling got harder to get at:

  • Sales intelligence teams qualifying leads by tech stack. "Companies using Shopify Plus and Klaviyo and Gorgias" is a useful ICP filter; running it requires tech-stack data.
  • Competitive analysis. Which of our competitors shipped a ReCharts dashboard in the last six months? Which moved off Segment?
  • Security researchers looking for vulnerable version footprints across the web. Drupal 7, Magento 1.x, WordPress <5.8 at scale for responsible disclosure.
  • Ad tech and partnerships. "Sites running GA4 but not yet on Google Signals" is an outreach list.
  • M&A diligence. Before buying a SaaS company, confirming their self-reported tech stack against what's actually in production.
  • Marketing attribution audits. Comparing declared marketing stack to observed pixels.
  • Web archaeology and research. Academic work on web technology diffusion.

None of these need a $5k/month subscription. Most need a scriptable CLI or API that emits "domain → list of detected technologies" on demand.

Comparison: Wappalyzer Enterprise vs. the alternatives

Tool Price Coverage JS execution Fingerprint count Self-host Best for
Wappalyzer Enterprise $250-$5,000/mo Whole web Yes ~2,700 (closed) No Legacy customers on contracts
BuiltWith Pro $295-$995/mo Whole web Limited ~70,000 (proprietary) No Broadest coverage, historical data
HG Insights Enterprise Whole web Yes Proprietary No Enterprise lead scoring
SimilarTech Enterprise Whole web Yes Proprietary No Marketing tech stack focus
whatcms.org $99/mo tiered CMS only No ~500 CMS-focused No CMS detection, narrow scope
Nuclei + tech templates Free Whole web No ~800 tech templates Yes Security-adjacent, OSS-first
OSS Wappalyzer forks Free Whole web DIY 251 bundled Yes Devs willing to operate it
wappalyzer-replacement $0.01/detection Whole web Yes (Puppeteer) 251 bundled + custom Hybrid Pay-per-use OSS ruleset + infra

A few honest calls:

  • BuiltWith has the most fingerprints in the market and historical data no one else has. If you need the long tail of obscure tech and the historical view, pay BuiltWith.
  • HG Insights and SimilarTech are pure enterprise plays; they won't sell to a solo developer.
  • Nuclei is a security scanner first, fingerprinter second. Templates skew toward CVE detection. Good for security teams, awkward for marketing teams.
  • OSS forks work great if you're willing to operate your own Chrome Headless fleet. The moment you hit the first Cloudflare challenge, you remember why it's not zero-ops.
  • wappalyzer-replacement sits in the middle: OSS ruleset, managed infrastructure, pay per detection.

The 251 fingerprints

The bundled ruleset in the current tunetheweb/wappalyzer fork covers 251 technologies across these categories:

  • Web frameworks: Next.js, Remix, Nuxt, SvelteKit, Astro, Gatsby, Hugo, Jekyll, Django, Rails, Laravel, Express, Fastify, FastAPI.
  • UI libraries: React, Vue, Svelte, Angular, Preact, Alpine.js, Stimulus, HTMX, Solid.
  • Component libraries: Material UI, Ant Design, Chakra UI, Radix UI, shadcn/ui, Bootstrap, Tailwind CSS.
  • CMS: WordPress, Shopify, Webflow, Ghost, Contentful, Sanity, Strapi, Directus, WooCommerce, Magento, BigCommerce, Drupal, Joomla, Squarespace, Wix, Framer.
  • Analytics: GA4, Mixpanel, Amplitude, Heap, Segment, Posthog, Plausible, Fathom, Simple Analytics, Matomo.
  • Ad/marketing tech: HubSpot, Marketo, Pardot, Klaviyo, Intercom, Drift, Crisp, Customer.io, Mailchimp, ActiveCampaign.
  • CDN/hosting: Cloudflare, Fastly, Akamai, AWS CloudFront, Vercel, Netlify, Cloudflare Pages, GitHub Pages, Fly.io, Railway.
  • Payment: Stripe, PayPal, Square, Shopify Payments, Adyen, Braintree, Klarna, Affirm.
  • Experimentation: Optimizely, VWO, LaunchDarkly, Statsig, GrowthBook, Split.
  • Monitoring: Sentry, Datadog RUM, LogRocket, FullStory, Hotjar, Smartlook.
  • Search: Algolia, Typesense, Meilisearch, ElasticSearch.
  • Tag management: GTM, Tealium, Segment.

251 is fewer than BuiltWith's 70k but covers the long tail of what matters for 95% of sales/marketing/competitive-analysis workflows. The actor accepts custom fingerprint rules too, so your team can contribute proprietary rules without round-tripping through the OSS repo.

Architecture

[domain list]
     |
     v
[wappalyzer-replacement actor]
     |
     +-> Puppeteer headless Chromium
     |      - navigates to domain
     |      - waits for network idle
     |      - evaluates fingerprint JS in page context
     |      - dumps cookies, globals, meta, script tags
     |
     +-> HTTP headers / HTML fetcher (fallback)
     |      - for static sites, faster path
     |
     +-> 251 OSS fingerprint rules
     |      - regex matches against:
     |        headers, cookies, HTML, JS globals,
     |        DOM attributes, script URLs, meta tags
     |
     +-> Version extraction
     |      - where fingerprints capture version groups
     |
     v
[JSON: { domain, technologies: [ { name, version?, confidence, categories } ] }]
Enter fullscreen mode Exit fullscreen mode

The actor runs Chromium inside an Apify container, applies stealth plugins (evasive User-Agent, WebGL fingerprint randomization, navigator.webdriver removal) and rotates through Apify's datacenter + residential proxy pool. For domains gated by Cloudflare Bot Fight Mode, the residential path typically succeeds; for harder targets (PerimeterX, DataDome on max-difficulty) you can explicitly request a residential proxy via input param at modest additional cost.

Code examples

Python: fingerprint a single domain

from apify_client import ApifyClient

client = ApifyClient("APIFY_TOKEN")

run = client.actor("nexgendata/wappalyzer-replacement").call(run_input={
    "domains": ["stripe.com"],
    "use_residential_proxy": False,
    "include_version": True,
})

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    for tech in item["technologies"]:
        version = f" {tech['version']}" if tech.get("version") else ""
        print(f"  - {tech['name']}{version} ({tech['confidence']}%)")
Enter fullscreen mode Exit fullscreen mode

Sample output:

  - Next.js 14.2 (100%)
  - React 18.2 (100%)
  - Stripe.js (100%)
  - Cloudflare (100%)
  - Segment (95%)
  - Sentry (90%)
  - HubSpot (85%)
  - Google Tag Manager (100%)
  - Intercom (100%)
Enter fullscreen mode Exit fullscreen mode

curl: fingerprint at scale

curl -X POST "https://api.apify.com/v2/acts/nexgendata~wappalyzer-replacement/run-sync-get-dataset-items?token=$APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "domains": ["shopify.com", "vercel.com", "supabase.com"],
    "use_residential_proxy": false,
    "timeout_per_domain_s": 30
  }'
Enter fullscreen mode Exit fullscreen mode

Returns a JSON array with one object per domain, each with a technologies array. Suitable for cron-driven weekly scans.

Node.js: enrich a CRM with tech-stack tags

const { ApifyClient } = require('apify-client');
const apify = new ApifyClient({ token: process.env.APIFY_TOKEN });

async function enrichLead(domain) {
  const run = await apify.actor('nexgendata/wappalyzer-replacement').call({
    domains: [domain],
    include_version: true,
  });
  const { items } = await apify.dataset(run.defaultDatasetId).listItems();
  const tech = items[0]?.technologies || [];

  return {
    domain,
    uses_shopify: tech.some(t => t.name === 'Shopify'),
    uses_klaviyo: tech.some(t => t.name === 'Klaviyo'),
    uses_gorgias: tech.some(t => t.name === 'Gorgias'),
    ecommerce_stack: tech.filter(t =>
      ['Shopify','BigCommerce','WooCommerce','Magento'].includes(t.name)
    ).map(t => t.name),
  };
}

(async () => {
  const enriched = await enrichLead('allbirds.com');
  console.log(enriched);
})();
Enter fullscreen mode Exit fullscreen mode

This pattern — detect specific technologies, emit booleans for the CRM — is the dominant use case for sales intelligence teams.

Python: bulk scan with category filter

run = client.actor("nexgendata/wappalyzer-replacement").call(run_input={
    "domains": [
        "notion.so", "linear.app", "figma.com",
        "clickup.com", "monday.com"
    ],
    "categories_filter": ["Analytics", "A/B Testing", "Customer Data Platform"],
    "include_version": False,
})

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    tags = [t["name"] for t in item["technologies"]]
    print(f"{item['domain']}: {', '.join(tags)}")
Enter fullscreen mode Exit fullscreen mode

The categories_filter input drops fingerprints outside the requested categories before emitting results. Useful when you only care about a specific layer of the stack.

curl: detect custom internal library

curl -X POST "https://api.apify.com/v2/acts/nexgendata~wappalyzer-replacement/run-sync-get-dataset-items?token=$APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "domains": ["competitor-a.com", "competitor-b.com"],
    "custom_fingerprints": [
      {
        "name": "AcmeTracker",
        "category": "Analytics",
        "scripts": ["acme-tracker\\.v[0-9]+\\.js"],
        "globals": ["__acmeTrackQueue"]
      }
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

Custom fingerprints follow the same schema as the OSS ruleset — regex matches against scripts, globals, cookies, headers, or HTML. Internal teams often maintain a small private ruleset for their own products and competitors, layered on top of the public 251.

Worked example: rebuilding a paywalled sales-ops playbook

A 30-person ecommerce agency had a 2022-era playbook that went like this:

  1. Pull a list of 10,000 Shopify stores in the US with Alexa rank between 100k and 1M (RIP).
  2. Fingerprint each to detect which ones were on Klaviyo (their affiliate partner) but NOT yet on Gorgias (their service offering).
  3. Contact the intersection — roughly 400-600 stores per pull.

When Wappalyzer's CLI stopped updating and BuiltWith wanted $995/month for the Basic tier (too shallow anyway — their Pro tier at $495/month had the Klaviyo/Gorgias breakdowns), the playbook was shelved.

With the replacement actor:

  1. Shopify store list comes from a companion shopify-store-directory actor ($50 for 10k stores).
  2. Fingerprint the 10k with wappalyzer-replacement at $0.01/detection = $100.
  3. Filter in-memory to uses_klaviyo && !uses_gorgias = ~450 stores.
  4. Export to the CRM for outreach.

Total cost: $150, done in about 90 minutes including the proxy warmup. The BuiltWith-equivalent would have been $995/month minimum, and would have required manually downloading CSVs rather than running headless in a pipeline.

The agency now re-runs this monthly. Yearly spend: ~$1,800. Previous equivalent: $12,000-$20,000. The savings paid for a junior SDR.

Accuracy and confidence scoring

The actor returns a confidence score between 0 and 100 for each detection. This is not noise — it maps directly to the OSS fingerprint format's weight system:

  • 100: Single-source definitive match. Example: X-Powered-By: Shopify header.
  • 90-99: Multiple weak signals OR one strong signal. Example: React detected via both a react-root DOM attribute and a __REACT_DEVTOOLS_GLOBAL_HOOK__ global.
  • 70-89: Single strong signal. Example: class name prefixed ant- (Ant Design), not independently confirmed by a script URL.
  • 50-69: Single weak signal. Example: a meta tag claiming a framework that isn't otherwise observable.
  • Below 50: Not emitted by default; can be surfaced via min_confidence: 0 input.

Downstream filtering on confidence >= 80 gives you roughly BuiltWith-grade accuracy. Below that, false-positive rates rise noticeably.

Handling anti-bot systems

At scale, anti-bot gates are the real cost driver. The actor has three modes:

  1. Direct HTTP fetch. Fastest, cheapest. Works for static sites and sites without bot protection. Succeeds on ~70% of domains.
  2. Puppeteer with datacenter proxy. Default for sites that need JavaScript execution. Handles SPAs. Succeeds on ~92% of domains. Fails on sites with serious Cloudflare/PerimeterX challenges.
  3. Puppeteer with residential proxy. Opt-in via use_residential_proxy: true. Higher per-run cost (additional ~$0.02 proxy surcharge). Succeeds on ~98%+ of domains including Cloudflare Bot Fight Mode and most PerimeterX configurations.

The default is to try mode 1, fall back to mode 2 on first failure. Residential opt-in is manual because the cost is not trivial at scale.

Gotchas

Common issues you'll hit at scale:

  • Cookie banners. Some detection paths (e.g., Tealium) fire only after consent. The actor clicks common accept buttons on loaded pages; rare ones slip through.
  • Lazy-loaded tags. GA4 via GTM container loads asynchronously; if you navigate and fingerprint within 2 seconds, you'll miss it. The actor uses networkidle2 waits to mitigate, but extremely lazy tags still require explicit wait_ms: 5000 overrides.
  • Shadow DOM. Web Components (custom elements with Shadow DOM) hide their internals. Fingerprints relying on class name prefixes miss them. Workaround is specific component-tag fingerprints.
  • Fingerprint drift. A React 17 rule may not match a specific React 18 bundle. Contribute an update to the OSS repo, or supply a custom_fingerprints override until the upstream rule updates.
  • Sites behind authentication. You cannot fingerprint what you can't see. Logged-out homepage gives you marketing stack only; actual app stack (the SPA framework, the UI lib) often requires an authenticated session. Some teams maintain a test account and pass cookies via input.
  • International sites. A .de site routed through Cloudflare's Germany POP will often have slightly different headers than the same site hit from a US IP. Detection is usually fine, but version extraction occasionally varies.
  • CDN masking. Fastly, Cloudflare, and Akamai strip Server and X-Powered-By headers from origin responses. Detection falls back to cookies and JavaScript globals, which cover most but not all cases.

FAQ

Is this really compatible with Wappalyzer's output format?

Close enough. The actor emits Wappalyzer's output schema — array of objects with name, version, confidence, categories. Wappalyzer's internal slug field is also present. If you had code parsing Wappalyzer JSON, it will parse this with no changes.

What's the relationship to the OSS forks?

The actor ships the current tunetheweb/wappalyzer fingerprint bundle at build time and updates on roughly a biweekly cadence. If a new fingerprint lands in the fork, you'll see it in the actor within 2 weeks. You can also bring-your-own rules via the custom_fingerprints input.

Can I self-host the same stack?

Yes. The fingerprint rules are MIT-licensed and publicly available. If you have a Chrome Headless fleet, a proxy rotator, and patience, you can run the equivalent locally. The actor is for teams who don't want to operate Chrome fleets.

How does this compare to BuiltWith for coverage?

Breadth: BuiltWith wins, 70k vs 251 fingerprints. Depth on the top 95% of relevant technologies: roughly comparable. Version detection: comparable. Historical data: BuiltWith wins (they have years of historical scans; the actor is point-in-time).

What about rate limits?

The actor itself has no per-customer rate limit. Apify's platform limits concurrent runs per account based on tier. At the Starter tier ($49/month) you can run ~32 concurrent actors, enough to fingerprint ~10,000 domains/hour with the default timeout.

Does it handle JavaScript-heavy SPAs correctly?

Yes. Puppeteer navigates, waits for networkidle2, then runs the fingerprinting script in the page context. Detection for React, Vue, Next.js, Gatsby, Nuxt, Svelte, etc., all work out of the box.

How do I contribute a new fingerprint?

The OSS forks are actively accepting PRs. github.com/tunetheweb/wappalyzer is the most active as of 2026. File format is JSON; schema is documented in the README. Once merged upstream, the actor picks it up in the next build.

Can I use this for security scanning?

You can, but Nuclei is a better fit for vulnerability detection — its templates are tuned for CVE patterns, not just technology presence. Use this actor to identify technology + version, then feed into a vulnerability lookup (NVD, GitHub Advisory Database).

What's next

If this fits your workflow, a few related actors in the same pipeline:

  • shopify-store-directory — pulls Shopify stores by country/category with estimated monthly revenue bands, pairs well with fingerprinting for ecommerce sales intelligence.
  • company-data-aggregator — eight-source OSINT aggregator (WHOIS, DNS, CT logs, GitHub, npm, tech headers) for broader company profiling.
  • jobs-tech-stack-extractor — scans company job postings for tech stack mentions, complementing runtime detection with hiring-intent signals.

Conclusion

Wappalyzer going closed-source was annoying, but the 251 fingerprints in the current OSS forks are sufficient for 95% of real-world tech-stack detection workflows. What you need on top of the fingerprint ruleset is infrastructure: Chrome headless, proxy rotation, anti-bot bypass, concurrency management. That is what the wappalyzer-replacement actor bundles at $0.01 per detection. BuiltWith still wins on raw fingerprint count and historical data; use them if that's what you need. For everyone else, paying per detection against an OSS ruleset is the right economics — and it scales down to indie-dev budgets in a way the enterprise tools will not.

Top comments (0)