DAPDEV

Posted on Feb 26

How to Detect What Technology Stack Any Website Is Using (Programmatically)

#python #api #webdev #tutorial

You're doing competitive analysis on a competitor's site. Or you're qualifying sales leads and need to know which prospects run Shopify. Or you're on a security audit and need to confirm which CMS versions are deployed across a client's domain portfolio.

In all of these cases, you need to answer the same question: what technology is this website actually running?

This post walks through how to detect website tech stacks — from manual approaches using raw HTTP inspection, to automated detection using an API. By the end you'll have working Python code for both approaches.

The Manual Approach: What Browsers Know That You Don't See

Before reaching for any tool, it helps to understand what signals technology leaves behind. There are four main layers to inspect.

1. HTTP Response Headers

Servers often leak stack information directly in response headers. A quick curl -I can reveal a surprising amount:

import httpx

def check_headers(url: str) -> dict:
    resp = httpx.get(url, follow_redirects=True, timeout=10)
    interesting = {}

    header_signals = {
        "x-powered-by": "runtime/framework",
        "server": "web server",
        "x-generator": "CMS",
        "x-drupal-cache": "Drupal",
        "x-shopify-stage": "Shopify",
        "x-wix-request-id": "Wix",
    }

    for header, label in header_signals.items():
        if header in resp.headers:
            interesting[label] = resp.headers[header]

    return interesting

print(check_headers("https://example.com"))

Common findings: X-Powered-By: PHP/8.1.2, Server: nginx, X-Generator: WordPress 6.4.

2. HTML Source Patterns

The page HTML itself is full of fingerprints — meta tags, script paths, CSS class names, and comment blocks:

import re
import httpx

def check_html(url: str) -> list[str]:
    resp = httpx.get(url, follow_redirects=True, timeout=10)
    html = resp.text
    detected = []

    patterns = {
        "WordPress": [
            r'/wp-content/',
            r'/wp-includes/',
            r'<meta name="generator" content="WordPress',
        ],
        "Shopify": [
            r'cdn\.shopify\.com',
            r'Shopify\.theme',
        ],
        "Next.js": [
            r'__NEXT_DATA__',
            r'/_next/static/',
        ],
        "Nuxt.js": [
            r'__NUXT__',
            r'/_nuxt/',
        ],
        "Wix": [
            r'static\.parastorage\.com',
            r'X-Wix-Meta-Site-Id',
        ],
    }

    for tech, fingerprints in patterns.items():
        if any(re.search(p, html) for p in fingerprints):
            detected.append(tech)

    return detected

3. DNS Records

DNS records can reveal infrastructure choices — CDN providers, email providers, and hosting platforms often leave clear trails:

import dns.resolver

def check_dns(domain: str) -> dict:
    results = {}

    try:
        cname = dns.resolver.resolve(domain, 'CNAME')
        results['cname'] = [str(r) for r in cname]
    except Exception:
        pass

    try:
        mx = dns.resolver.resolve(domain, 'MX')
        mx_hosts = [str(r.exchange) for r in mx]
        if any('google' in h for h in mx_hosts):
            results['email'] = 'Google Workspace'
        elif any('outlook' in h or 'microsoft' in h for h in mx_hosts):
            results['email'] = 'Microsoft 365'
    except Exception:
        pass

    try:
        txt = dns.resolver.resolve(domain, 'TXT')
        for record in txt:
            record_str = str(record)
            if 'v=spf1' in record_str:
                results['spf'] = record_str
    except Exception:
        pass

    return results

4. TLS Certificate Details

The TLS certificate's Subject Alternative Names (SANs) can reveal CDN providers and related domains:

import ssl
import socket

def check_tls(domain: str) -> dict:
    ctx = ssl.create_default_context()
    with ctx.wrap_socket(socket.socket(), server_hostname=domain) as s:
        s.connect((domain, 443))
        cert = s.getpeercert()

    info = {
        "issuer": dict(x[0] for x in cert["issuer"]),
        "subject": dict(x[0] for x in cert["subject"]),
        "san": [v for _, v in cert.get("subjectAltName", [])],
    }
    return info

Cloudflare-issued certs, for example, are a dead giveaway that a site is behind Cloudflare's CDN.

The Problem With DIY Detection

Writing these checks yourself is instructive, but there are serious maintenance problems with the DIY approach:

Fingerprint databases go stale fast. New frameworks ship constantly. WordPress updates its patterns. Next.js changes its build output. Your regex collection will bit-rot within months.

Edge cases are everywhere. Sites behind CDNs mask headers. Headless CMS setups don't emit typical CMS fingerprints. Some sites serve different content to bots vs. browsers.

Coverage gaps. The four layers above catch the obvious cases. But what about detecting that a site uses Segment vs. Heap for analytics? Or Intercom vs. Drift for live chat? Or that their checkout is actually Stripe.js hosted on their own domain? Each of those requires its own fingerprint logic.

You'd be rebuilding Wappalyzer — which, notably, was archived in 2023 after being acquired. The commercial alternative, BuiltWith, runs $295/month at minimum. Neither is practical for a side project or small team.

Using an API Instead

The Technology Detection API on RapidAPI handles fingerprint maintenance, framework coverage, and edge cases for you. It checks headers, HTML content, DNS, scripts, cookies, and meta tags across hundreds of technologies in a single call.

Install the Python client:

pip install techdetect

Then detect the tech stack of any URL:

from techdetect import TechDetectClient

client = TechDetectClient(api_key="your_rapidapi_key")

result = client.detect("https://shopify.com")

for tech in result.technologies:
    print(f"{tech.name} ({tech.category}): confidence {tech.confidence}%")

Real Output on Real Sites

Here's what you actually get back for a few popular sites.

A WordPress site:

{
  "url": "https://techcrunch.com",
  "technologies": [
    { "name": "WordPress", "category": "CMS", "confidence": 99 },
    { "name": "PHP", "category": "Programming Language", "confidence": 95 },
    { "name": "MySQL", "category": "Database", "confidence": 80 },
    { "name": "Cloudflare", "category": "CDN", "confidence": 97 },
    { "name": "Google Analytics 4", "category": "Analytics", "confidence": 91 },
    { "name": "Jetpack", "category": "WordPress Plugin", "confidence": 88 }
  ]
}

A Shopify store:

{
  "url": "https://allbirds.com",
  "technologies": [
    { "name": "Shopify", "category": "Ecommerce", "confidence": 100 },
    { "name": "Shopify Plus", "category": "Ecommerce", "confidence": 85 },
    { "name": "Klaviyo", "category": "Email Marketing", "confidence": 92 },
    { "name": "Yotpo", "category": "Reviews", "confidence": 78 },
    { "name": "Cloudflare", "category": "CDN", "confidence": 97 }
  ]
}

A Next.js app:

{
  "url": "https://vercel.com",
  "technologies": [
    { "name": "Next.js", "category": "JavaScript Framework", "confidence": 99 },
    { "name": "React", "category": "JavaScript Library", "confidence": 99 },
    { "name": "Vercel", "category": "PaaS", "confidence": 100 },
    { "name": "TypeScript", "category": "Programming Language", "confidence": 75 }
  ]
}

The confidence scores matter here. A score of 99 means the detection is based on a definitive fingerprint (like a meta tag that only WordPress emits). A score of 60-75 means the signal is suggestive but not conclusive.

Putting It Together: Bulk Detection

Here's a more practical pattern — detecting stacks across a list of URLs and filtering by technology:

from techdetect import TechDetectClient
import csv

client = TechDetectClient(api_key="your_rapidapi_key")

urls = [
    "https://example1.com",
    "https://example2.com",
    "https://example3.com",
]

results = []
for url in urls:
    try:
        result = client.detect(url)
        tech_names = [t.name for t in result.technologies]
        results.append({
            "url": url,
            "cms": next((t.name for t in result.technologies if t.category == "CMS"), "Unknown"),
            "technologies": ", ".join(tech_names),
        })
    except Exception as e:
        results.append({"url": url, "cms": "Error", "technologies": str(e)})

# Write to CSV
with open("tech_results.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["url", "cms", "technologies"])
    writer.writeheader()
    writer.writerows(results)

print(f"Done. Results written to tech_results.csv")

Pricing Reality Check

If you're doing this at any kind of scale, here's what the options actually cost:

Tool	Price	Notes
BuiltWith	$295/month	Full database access; no API on lower tiers
Wappalyzer	~$250/month	Archived in 2023; APIs shutting down
SimilarTech	$199/month	Bulk-only; no per-URL API
DIY (self-maintained)	Engineering time	Stale within months
Technology Detection API	$9/month	2,000 lookups/month on Pro plan

For a side project, internal tool, or early-stage startup, the math is straightforward.

Source Code and Further Reading

The full Python client, including async support and rate limiting, is on GitHub: github.com/dapdevsoftware/techdetect-python

pip install techdetect

API docs and a free tier (no credit card required) are available at RapidAPI. The free plan covers enough requests to prototype and validate your use case before committing to anything.

If you're building something with this — a lead gen tool, a competitive intelligence dashboard, a CMS auditor — drop a comment below. Happy to talk through the architecture.

DEV Community