DEV Community: Ayi NEDJIMI

Building a CVSS Score Calculator and Risk Prioritizer in Python

Ayi NEDJIMI — Tue, 21 Jul 2026 10:04:25 +0000

Every security team drowns in CVEs. The NVD publishes hundreds of new vulnerabilities each week, and triaging them manually is unsustainable. CVSS (Common Vulnerability Scoring System) gives you a base score — but knowing a vulnerability is rated 9.8 does not tell you whether to patch it tonight or next month.

What CVSS v3.1 actually measures

CVSS v3.1 decomposes into three metric groups:

Base: inherent characteristics — attack vector, complexity, privileges required, user interaction, scope, and confidentiality/integrity/availability impact
Temporal: how exploitability evolves over time (exploit code maturity, remediation level, report confidence)
Environmental: adjustments for your specific context (modified base metrics, CIA requirements)

The base score is not a simple weighted average. It uses a non-linear formula defined in the specification, with separate impact sub-score calculations depending on whether the vulnerability scope changes. Implementing it from scratch is the best way to understand where the number comes from — and where it stops being useful.

Implementing the CVSS v3.1 base score formula

from dataclasses import dataclass
from typing import Literal
import math

# Metric weights from CVSS v3.1 specification
AV_WEIGHTS = {"N": 0.85, "A": 0.62, "L": 0.55, "P": 0.2}
AC_WEIGHTS = {"L": 0.77, "H": 0.44}
PR_WEIGHTS = {
    "N": {"changed": 0.85, "unchanged": 0.85},
    "L": {"changed": 0.68, "unchanged": 0.62},
    "H": {"changed": 0.50, "unchanged": 0.27},
}
UI_WEIGHTS = {"N": 0.85, "R": 0.62}
CIA_WEIGHTS = {"N": 0.00, "L": 0.22, "H": 0.56}

@dataclass
class CVSSv3Base:
    attack_vector: Literal["N", "A", "L", "P"]
    attack_complexity: Literal["L", "H"]
    privileges_required: Literal["N", "L", "H"]
    user_interaction: Literal["N", "R"]
    scope: Literal["U", "C"]        # Unchanged or Changed
    confidentiality: Literal["N", "L", "H"]
    integrity: Literal["N", "L", "H"]
    availability: Literal["N", "L", "H"]

    def calculate(self) -> float:
        scope_key = "changed" if self.scope == "C" else "unchanged"

        iss = (
            1
            - (1 - CIA_WEIGHTS[self.confidentiality])
            * (1 - CIA_WEIGHTS[self.integrity])
            * (1 - CIA_WEIGHTS[self.availability])
        )

        if self.scope == "U":
            impact = 6.42 * iss
        else:
            impact = 7.52 * (iss - 0.029) - 3.25 * ((iss - 0.02) ** 15)

        if impact <= 0:
            return 0.0

        exploitability = (
            8.22
            * AV_WEIGHTS[self.attack_vector]
            * AC_WEIGHTS[self.attack_complexity]
            * PR_WEIGHTS[self.privileges_required][scope_key]
            * UI_WEIGHTS[self.user_interaction]
        )

        if self.scope == "U":
            raw = min(impact + exploitability, 10)
        else:
            raw = min(1.08 * (impact + exploitability), 10)

        # Round up to one decimal per specification
        return math.ceil(raw * 10) / 10

Test against CVE-2021-44228 (Log4Shell), which carries base metrics AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:H/A:H:

log4shell = CVSSv3Base(
    attack_vector="N",
    attack_complexity="L",
    privileges_required="N",
    user_interaction="N",
    scope="C",
    confidentiality="H",
    integrity="H",
    availability="H",
)
print(log4shell.calculate())  # -> 10.0

This matches NVD's published score exactly. The key insight: the scope == "C" branch uses a more aggressive formula because the vulnerability can affect resources beyond the vulnerable component itself.

From scores to priorities: the risk prioritizer

Raw CVSS scores treat all 10.0s identically. A 10.0 on a dev machine with no internet access is less urgent than an 8.5 on your public-facing authentication service. The prioritizer multiplies the base score by context factors you actually control.

This model considers four dimensions: CVSS score, internet exposure, whether exploit code is publicly available, and asset criticality (rated 1-5):

from dataclasses import dataclass
import math

@dataclass
class VulnerabilityRecord:
    cve_id: str
    cvss_score: float
    asset_name: str
    internet_exposed: bool
    exploit_public: bool
    asset_criticality: int  # 1 (low) to 5 (critical)

    def risk_score(self) -> float:
        base = self.cvss_score / 10.0
        exposure = 1.4 if self.internet_exposed else 1.0
        exploit = 1.5 if self.exploit_public else 1.0
        criticality = 0.5 + (self.asset_criticality * 0.2)
        raw = base * exposure * exploit * criticality
        return min(math.ceil(raw * 100) / 100, 10.0)

    def priority(self) -> str:
        score = self.risk_score()
        if score >= 8.0:
            return "P1 - Patch immediately"
        elif score >= 6.0:
            return "P2 - Patch within 72 hours"
        elif score >= 4.0:
            return "P3 - Patch within 2 weeks"
        return "P4 - Schedule normally"


def prioritize(vulns: list[VulnerabilityRecord]) -> list[VulnerabilityRecord]:
    return sorted(vulns, key=lambda v: v.risk_score(), reverse=True)


# Example
vulns = [
    VulnerabilityRecord("CVE-2021-44228", 10.0, "api-gateway", True, True, 5),
    VulnerabilityRecord("CVE-2023-12345", 7.8, "internal-wiki", False, False, 2),
    VulnerabilityRecord("CVE-2024-99999", 9.1, "auth-service", True, False, 4),
]

for v in prioritize(vulns):
    print(f"{v.cve_id} | risk={v.risk_score():.2f} | {v.priority()}")

Output:

CVE-2021-44228 | risk=10.00 | P1 - Patch immediately
CVE-2024-99999 | risk=6.55  | P2 - Patch within 72 hours
CVE-2023-12345 | risk=2.34  | P4 - Schedule normally

The internal wiki vulnerability scores 7.8 on CVSS but falls to P4 because it has no internet exposure, no public exploit, and low asset criticality. That is the useful signal a raw CVSS score cannot give you.

Pulling live CVE data from the NVD API v2

Instead of entering scores manually, fetch them directly from NVD:

import httpx

NVD_API = "https://services.nvd.nist.gov/rest/json/cves/2.0"

def fetch_cvss_score(cve_id: str, api_key: str | None = None) -> float | None:
    headers = {}
    if api_key:
        headers["apiKey"] = api_key

    resp = httpx.get(
        NVD_API,
        params={"cveId": cve_id},
        headers=headers,
        timeout=10,
    )
    resp.raise_for_status()
    data = resp.json()

    vulns = data.get("vulnerabilities", [])
    if not vulns:
        return None

    metrics = vulns[0]["cve"].get("metrics", {})
    # Prefer v3.1, fall back to v3.0
    cvss_data = metrics.get("cvssMetricV31") or metrics.get("cvssMetricV30")
    if not cvss_data:
        return None

    return cvss_data[0]["cvssData"]["baseScore"]

NVD rate-limits unauthenticated requests to roughly 5 per 30 seconds. Register for a free API key at nvd.nist.gov to unlock 50 per 30 seconds — necessary for any batch job that covers more than a handful of packages.

One practical gotcha: some CVEs only carry a v2 score (cvssMetricV2). If you need full coverage, fall back to v2 as a last resort and normalize the score, since v2 and v3 are not directly comparable.

Integrating into a vulnerability management workflow

The individual pieces above become valuable when automated. A practical nightly pipeline looks like this:

Pull CVEs published or modified in the last 24 hours from NVD
Filter to packages present in your SBOM (Software Bill of Materials)
Instantiate VulnerabilityRecord objects using your asset inventory
Sort by risk_score() and post P1/P2 results to Slack or Jira

Step 3 is where consistency matters most. Ad hoc criticality ratings are the most common reason a P4 becomes an incident: different teams classify the same asset differently, and the triage logic inherits the discrepancy. Formalizing asset classification early — ideally through a security hardening checklist covering your asset categories and criticality criteria — prevents that drift before it causes damage.

Two external data sources worth layering on top of CVSS:

EPSS (Exploit Prediction Scoring System): a daily probabilistic score for exploitation likelihood in the next 30 days, available free from first.org/epss. An EPSS score above 0.5 is a reliable signal to escalate regardless of CVSS severity.
CISA KEV (Known Exploited Vulnerabilities): a curated list of CVEs with confirmed in-the-wild exploitation. Any CVE on KEV should be treated as P1 unconditionally, regardless of your context-weighted score. The list is a free JSON feed at cisa.gov/known-exploited-vulnerabilities-catalog.

Combining CVSS, EPSS, and KEV gives you a three-layer triage. KEV entries are always P1. High-EPSS items get bumped one tier. Everything else falls through the context-weighted risk score. This covers the three main failure modes: unknown exploited vulns (KEV fixes this), low-CVSS-but-actively-exploited (EPSS fixes this), and high-CVSS-but-irrelevant-in-context (the prioritizer fixes this).

The takeaway

CVSS scores are input, not output. A base score ignores your environment, your exposure, and whether anyone is actually exploiting the vulnerability. The context-weighted risk prioritizer on top — even with four simple multipliers — produces a fundamentally more actionable result.

The code here is production-ready with minimal modification. Connect it to the NVD API for live scores, your SBOM for package filtering, and your asset inventory for criticality ratings, and you have the core of an internal vulnerability management pipeline without paying for a dedicated commercial tool. Start small: one Slack webhook for P1s, one cron job, and a spreadsheet for the asset inventory. Refine the scoring multipliers based on your own incident history over time.

I run AYI NEDJIMI Consultants, a cybersecurity consulting firm. We publish free security hardening checklists — PDF and Excel.

How to Build an AI Agent That Monitors Security Advisories

Ayi NEDJIMI — Mon, 20 Jul 2026 10:04:58 +0000

Security teams drown in advisory noise. NVD alone publishes hundreds of CVEs per week, the GitHub Advisory Database adds more, and every vendor has their own feed. Without filtering, your team either ignores everything or spends hours manually triaging bulletins to find the three that actually matter for your stack.

This guide shows how to build a lightweight Python agent that fetches advisories, uses a language model to triage and summarize them, and dispatches alerts only for what is relevant to your specific tech stack.

What We're Building

The agent has three stages:

Fetch — pull advisories from NVD API v2
Triage — pass each advisory through an LLM that classifies it against your dependency list
Alert — send a formatted summary to Slack for advisories that score above your threshold

Dependencies: httpx for HTTP, any OpenAI-compatible SDK for LLM calls, and the standard library for everything else.

Fetching Advisories from NVD API v2

NIST updated their API in 2023 — the v1 endpoint is deprecated. The v2 API requires an API key for higher rate limits (2,000 requests/day on the free tier vs 5 per 30 seconds without one).

import httpx
from datetime import datetime, timedelta, timezone
from typing import Iterator

NVD_API_URL = "https://services.nvd.nist.gov/rest/json/cves/2.0"
NVD_API_KEY = "your-nvd-api-key"  # optional but strongly recommended

def fetch_recent_cves(hours_back: int = 24) -> Iterator[dict]:
    now = datetime.now(timezone.utc)
    start = now - timedelta(hours=hours_back)

    params = {
        "pubStartDate": start.strftime("%Y-%m-%dT%H:%M:%S.000"),
        "pubEndDate":   now.strftime("%Y-%m-%dT%H:%M:%S.000"),
        "resultsPerPage": 100,
    }
    headers = {"apiKey": NVD_API_KEY} if NVD_API_KEY else {}

    start_index = 0
    total = None

    while total is None or start_index < total:
        params["startIndex"] = start_index
        r = httpx.get(NVD_API_URL, params=params, headers=headers, timeout=30)
        r.raise_for_status()
        data = r.json()

        total = data["totalResults"]
        vulnerabilities = data.get("vulnerabilities", [])

        for item in vulnerabilities:
            cve = item["cve"]
            yield {
                "id": cve["id"],
                "description": next(
                    (d["value"] for d in cve.get("descriptions", []) if d["lang"] == "en"),
                    "",
                ),
                "cvss_score": _extract_cvss(cve),
                "published": cve["published"],
            }

        start_index += len(vulnerabilities)
        if not vulnerabilities:
            break

def _extract_cvss(cve: dict) -> float | None:
    metrics = cve.get("metrics", {})
    for key in ("cvssMetricV31", "cvssMetricV30", "cvssMetricV2"):
        if key in metrics:
            return metrics[key][0]["cvssData"]["baseScore"]
    return None

The pagination loop handles the 100-result-per-page cap transparently. On a typical day you will get 200–400 CVEs; filtering happens in the next stage.

LLM-Powered Triage

The raw feed is too large to alert on. We want to keep only advisories that affect packages we actually use. Rather than maintaining a complex regex ruleset against CVE descriptions, we pass each advisory to a language model with our stack manifest.

import json
from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from environment

STACK_MANIFEST = (
    "Python packages: fastapi, sqlalchemy, pydantic, httpx, celery, redis-py\n"
    "Go modules: gin, gorm, pgx, fiber\n"
    "Infrastructure: nginx 1.24, postgresql 15, redis 7"
)

TRIAGE_PROMPT = (
    "You are a security triage assistant. Given a CVE and a tech stack, determine:\n"
    "1. Whether this CVE affects any component in the stack (yes/no)\n"
    "2. A one-sentence plain-English summary of the impact\n"
    "3. An urgency level: critical (patch today), high (patch this week), low (monitor)\n\n"
    "Respond in JSON only with keys: affects_stack (bool), summary (str), urgency (str).\n\n"
    "Tech stack:\n{stack}\n\nCVE {cve_id} (CVSS: {cvss}):\n{description}"
)

def triage_cve(cve: dict) -> dict | None:
    prompt = TRIAGE_PROMPT.format(
        stack=STACK_MANIFEST,
        cve_id=cve["id"],
        cvss=cve["cvss_score"] or "N/A",
        description=cve["description"],
    )

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0,
    )

    result = json.loads(response.choices[0].message.content)

    if not result.get("affects_stack"):
        return None

    return {**cve, **result}

Using response_format={"type": "json_object"} enforces structured output — no need to parse free-form text or handle hallucinated formats.

At roughly $0.00015 per 1K input tokens and an average CVE description of ~300 tokens, triaging 400 CVEs costs under $0.10/day.

Dispatching Alerts to Slack

Only critical and high-urgency advisories that affect your stack make it to the alert stage.

import httpx

SLACK_WEBHOOK_URL = "https://hooks.slack.com/services/your/webhook/url"

def send_slack_alert(cves: list[dict]) -> None:
    if not cves:
        return

    blocks: list[dict] = [
        {
            "type": "header",
            "text": {
                "type": "plain_text",
                "text": f"Security Advisory Alert: {len(cves)} finding(s)",
            },
        }
    ]

    for cve in cves:
        urgency_tag = "[CRITICAL]" if cve["urgency"] == "critical" else "[HIGH]"
        blocks.append({
            "type": "section",
            "text": {
                "type": "mrkdwn",
                "text": (
                    f"*{urgency_tag} {cve['id']}* (CVSS {cve['cvss_score'] or '?'})\n"
                    f"{cve['summary']}\n"
                    f"<https://nvd.nist.gov/vuln/detail/{cve['id']}|View on NVD>"
                ),
            },
        })

    httpx.post(SLACK_WEBHOOK_URL, json={"blocks": blocks}, timeout=10).raise_for_status()

Putting It Together

def run_advisory_agent(hours_back: int = 24, min_cvss: float = 5.0) -> None:
    print(f"Fetching CVEs from the last {hours_back}h...")
    cves = list(fetch_recent_cves(hours_back))
    print(f"  Fetched {len(cves)} CVEs total")

    # Pre-filter by CVSS before spending LLM tokens
    candidates = [c for c in cves if (c["cvss_score"] or 0) >= min_cvss]
    print(f"  {len(candidates)} above CVSS {min_cvss} — triaging with LLM...")

    relevant = []
    for cve in candidates:
        result = triage_cve(cve)
        if result and result["urgency"] in ("critical", "high"):
            relevant.append(result)

    print(f"  {len(relevant)} relevant advisories — dispatching alert")
    send_slack_alert(relevant)

if __name__ == "__main__":
    run_advisory_agent()

Schedule this with cron or a cloud scheduler to run every 24 hours. A complete checklist for hardening this pipeline — including API key management, retry logic, and alert deduplication — is available at ayinedjimi-consultants.fr/checklists.

The Takeaway

The agent pattern here — fetch, pre-filter cheaply, triage with an LLM, alert — scales to any advisory source. You can swap NVD for the GitHub Advisory Database, OSV.dev, or vendor-specific feeds without changing the triage or alert layers.

A few practical notes:

Pre-filter by CVSS before LLM calls. Most CVEs score below 7.0 and can be discarded without touching the model, cutting token costs significantly.
Keep your stack manifest updated. An outdated manifest is worse than no manifest — it gives false confidence that an advisory does not affect you.
Add deduplication. Store processed CVE IDs in a SQLite table to avoid re-alerting on the same advisory across runs.
Tune the CVSS threshold. CVSS 5.0+ as a pre-filter and "critical"/"high" from the LLM is a reasonable starting point. Adjust based on your team's noise tolerance.
Rotate your NVD API key like any other credential. The free tier allows 2,000 requests per day — more than enough for a daily agent.

I run AYI NEDJIMI Consultants, a cybersecurity consulting firm. We publish free security hardening checklists — PDF and Excel.

Implementing Webhook Signature Verification (GitHub, Stripe, Slack)

Ayi NEDJIMI — Sat, 18 Jul 2026 10:01:43 +0000

When a third-party service fires a webhook at your endpoint, nothing in HTTP prevents someone else from sending a forged request to that same URL. Without signature verification, any attacker who discovers your endpoint can trigger deployments, replay payment events, or inject bot commands. GitHub, Stripe, and Slack all sign their webhook payloads with HMAC-SHA256 — but each platform has its own header format, timestamp handling, and edge cases. Getting the implementation wrong either breaks the integration silently or leaves a security gap that's easy to miss in code review.

The Core Mechanism: HMAC-SHA256 and Constant-Time Comparison

All three platforms follow the same underlying model: the sender signs the raw request body with a shared secret using HMAC-SHA256, places the hex digest in a request header, and your server recomputes the expected signature to compare.

The one rule that applies everywhere: use constant-time comparison. A regular == check is vulnerable to timing attacks — an attacker can infer the correct signature one byte at a time by measuring response latency. Python's hmac.compare_digest handles this correctly:

import hmac
import hashlib

def verify_signature(secret: bytes, body: bytes, received_sig: str, expected_prefix: str = "") -> bool:
    mac = hmac.new(secret, body, hashlib.sha256)
    expected = expected_prefix + mac.hexdigest()
    return hmac.compare_digest(expected, received_sig)

This helper will be the foundation for all three integrations below.

GitHub Webhook Verification

GitHub puts the signature in X-Hub-Signature-256, prefixed with sha256=. The signature covers the raw request body exactly as received.

from fastapi import FastAPI, Request, HTTPException
import hmac, hashlib

app = FastAPI()
GITHUB_SECRET = b"your_webhook_secret_here"

@app.post("/webhooks/github")
async def github_webhook(request: Request):
    body = await request.body()

    sig_header = request.headers.get("X-Hub-Signature-256", "")
    if not sig_header.startswith("sha256="):
        raise HTTPException(status_code=400, detail="Missing X-Hub-Signature-256")

    expected = "sha256=" + hmac.new(GITHUB_SECRET, body, hashlib.sha256).hexdigest()

    if not hmac.compare_digest(sig_header, expected):
        raise HTTPException(status_code=401, detail="Signature mismatch")

    event_type = request.headers.get("X-GitHub-Event", "unknown")
    payload = await request.json()
    # handle payload...
    return {"event": event_type, "status": "accepted"}

The single most common failure here: reading request.json() before reading request.body(). Once FastAPI consumes the body stream to parse JSON, calling .body() afterward returns an empty bytes object. Always read raw bytes first.

GitHub does not embed a timestamp in its signature, which means there is no built-in replay protection. For high-impact events — merges to main, deployment triggers — you should add your own: store a hash of each processed payload in Redis or a database with a short TTL (five minutes is standard) and reject duplicates.

Stripe Webhook Verification

Stripe's Stripe-Signature header includes both a timestamp (t=) and one or more signatures (v1=). The signature covers the string {timestamp}.{raw_body}. Stripe enforces replay protection by design: you are expected to reject requests where the timestamp is more than 300 seconds old.

import time
import hmac, hashlib
from fastapi import FastAPI, Request, HTTPException

app = FastAPI()
STRIPE_SECRET = b"whsec_your_stripe_endpoint_secret"
TOLERANCE_SECONDS = 300

@app.post("/webhooks/stripe")
async def stripe_webhook(request: Request):
    body = await request.body()
    sig_header = request.headers.get("Stripe-Signature", "")

    # Parse "t=...,v1=...,v1=..." format
    parts = {}
    for item in sig_header.split(","):
        k, _, v = item.partition("=")
        parts.setdefault(k.strip(), []).append(v.strip())

    timestamp = parts.get("t", [None])[0]
    signatures = parts.get("v1", [])

    if not timestamp or not signatures:
        raise HTTPException(status_code=400, detail="Malformed Stripe-Signature header")

    if abs(time.time() - int(timestamp)) > TOLERANCE_SECONDS:
        raise HTTPException(status_code=400, detail="Webhook timestamp too old")

    signed_payload = f"{timestamp}.".encode() + body
    expected = hmac.new(STRIPE_SECRET, signed_payload, hashlib.sha256).hexdigest()

    if not any(hmac.compare_digest(s, expected) for s in signatures):
        raise HTTPException(status_code=401, detail="Invalid Stripe signature")

    return {"status": "ok"}

Note the loop over all v1= entries: during a webhook secret rotation, Stripe sends both the old and new signatures in the same header. Your endpoint needs to accept either, otherwise you'll drop events during the rotation window. This is a detail that bites teams the first time they rotate a secret in production.

Slack Webhook Verification

Slack uses X-Slack-Signature (with a v0= prefix) and X-Slack-Request-Timestamp as separate headers. The signed payload format is v0:{timestamp}:{raw_body} — note the colon separators, not a period like Stripe.

import time
import hmac, hashlib
from fastapi import FastAPI, Request, HTTPException

app = FastAPI()
SLACK_SIGNING_SECRET = b"your_slack_app_signing_secret"
TOLERANCE_SECONDS = 300

@app.post("/webhooks/slack")
async def slack_webhook(request: Request):
    body = await request.body()

    timestamp = request.headers.get("X-Slack-Request-Timestamp", "")
    slack_sig = request.headers.get("X-Slack-Signature", "")

    if not timestamp or not slack_sig:
        raise HTTPException(status_code=400, detail="Missing Slack signature headers")

    if abs(time.time() - int(timestamp)) > TOLERANCE_SECONDS:
        raise HTTPException(status_code=400, detail="Request too old")

    basestring = f"v0:{timestamp}:".encode() + body
    expected = "v0=" + hmac.new(SLACK_SIGNING_SECRET, basestring, hashlib.sha256).hexdigest()

    if not hmac.compare_digest(slack_sig, expected):
        raise HTTPException(status_code=401, detail="Invalid Slack signature")

    return {"challenge": (await request.json()).get("challenge")}

The last line handles Slack's URL verification challenge — when you first configure the endpoint in Slack's app settings, it sends a POST with a challenge field that you must echo back. This only runs after the signature check passes.

Common Pitfalls to Avoid

Raw bytes, always. Web frameworks that automatically parse JSON or form data will give you a deserialized object. Re-serializing it for signing produces different bytes due to whitespace and key ordering. Read raw bytes before any parsing.

HTTPS is non-negotiable. Signature verification is meaningless over plain HTTP — a man-in-the-middle can replace both the body and the signature simultaneously. Enforce HTTPS at the load balancer or reverse proxy level and return a hard error on plain HTTP requests.

Do not log secrets in startup output. Webhook secrets are credentials. Storing them in environment variables is fine; printing WEBHOOK_SECRET=... at boot is not.

Test rejection, not just acceptance. CI pipelines commonly test the happy path — a valid signature succeeds. Add a negative test that sends a tampered body and confirms your endpoint returns 401. If that test doesn't exist, you won't notice when a future refactor breaks the verification path.

For a structured list of webhook and API security controls worth implementing, our security hardening checklists cover this and related patterns in both PDF and Excel formats.

The Takeaway

GitHub, Stripe, and Slack all use HMAC-SHA256, but the differences in header names, payload format, and timestamp handling are enough to cause silent verification failures if you copy-paste between integrations. The implementation checklist is short:

Read raw bytes before any deserialization
Use hmac.compare_digest — not ==
Enforce a timestamp tolerance window (add one yourself if the platform doesn't require it)
Handle multi-signature headers during secret rotation (Stripe)
Write a test that sends an invalid signature and confirms rejection

The patterns above work without any third-party webhook library. The dependencies are hmac and hashlib, both in the Python standard library. Less is more when it comes to security-critical code paths.

I run AYI NEDJIMI Consultants, a cybersecurity consulting firm. We publish free security hardening checklists — PDF and Excel.

Building a Rate Limiter in Go: Token Bucket vs Sliding Window

Ayi NEDJIMI — Fri, 17 Jul 2026 10:04:58 +0000

Rate limiting is one of those features that looks simple until you're paged at 2 am because a client is hammering your API and taking down the service. Adding it after the fact is always more painful than doing it right the first time — so this post covers two practical algorithms, when to pick each, and how to implement both cleanly in Go.

Why Rate Limiting Matters

Without rate limiting, a single misbehaving client can exhaust your CPU, database connections, or third-party API quota. This is both a reliability problem and a security vector: credential stuffing, brute-force attacks, and scraping bots all rely on the absence of effective limits. Any production API exposed to the internet should have rate limiting in place before the first deploy.

A secondary concern is fairness: in a multi-tenant service, one noisy client should not degrade the experience for everyone else. Rate limiting enforces that contract.

Algorithm 1: Token Bucket

The token bucket algorithm is the most common choice. The idea is straightforward: a bucket holds up to capacity tokens. Tokens are refilled at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected.

Why it is popular: it naturally handles burst traffic. A client that has been idle can consume several tokens at once up to the bucket capacity, then gets throttled to the refill rate. This matches how most real-world clients behave — SDKs retry with backoff, mobile apps batch operations, and occasional bursts are expected.

Here is a thread-safe token bucket in pure Go with no external dependencies:

package ratelimit

import (
    "sync"
    "time"
)

type TokenBucket struct {
    mu       sync.Mutex
    tokens   float64
    capacity float64
    rate     float64 // tokens per second
    lastTime time.Time
}

func NewTokenBucket(capacity, ratePerSec float64) *TokenBucket {
    return &TokenBucket{
        tokens:   capacity,
        capacity: capacity,
        rate:     ratePerSec,
        lastTime: time.Now(),
    }
}

func (tb *TokenBucket) Allow() bool {
    tb.mu.Lock()
    defer tb.mu.Unlock()

    now := time.Now()
    elapsed := now.Sub(tb.lastTime).Seconds()
    tb.lastTime = now

    tb.tokens = clamp(tb.capacity, tb.tokens+elapsed*tb.rate)

    if tb.tokens < 1 {
        return false
    }
    tb.tokens--
    return true
}

func clamp(max, val float64) float64 {
    if val > max {
        return max
    }
    return val
}

The catch: the token bucket is permissive by design. A client with a full bucket can fire a significant spike in a very short window. For APIs where consistent throughput ceilings matter more than burst tolerance — auth flows, payment endpoints, account recovery — that can be a problem.

Algorithm 2: Sliding Window Counter

The sliding window algorithm provides smoother, stricter enforcement. Instead of a bucket, it tracks request counts in a rolling time window. The common approximation uses two fixed buckets (current window and previous window) and interpolates based on how far into the current window you are — accurate enough for rate limiting without storing every individual timestamp.

package ratelimit

import (
    "sync"
    "time"
)

type SlidingWindow struct {
    mu          sync.Mutex
    limit       int
    windowSize  time.Duration
    currCount   int
    prevCount   int
    windowStart time.Time
}

func NewSlidingWindow(limit int, window time.Duration) *SlidingWindow {
    return &SlidingWindow{
        limit:       limit,
        windowSize:  window,
        windowStart: time.Now(),
    }
}

func (sw *SlidingWindow) Allow() bool {
    sw.mu.Lock()
    defer sw.mu.Unlock()

    now := time.Now()
    elapsed := now.Sub(sw.windowStart)

    if elapsed >= sw.windowSize {
        periods := int(elapsed / sw.windowSize)
        if periods > 1 {
            sw.prevCount = 0
        } else {
            sw.prevCount = sw.currCount
        }
        sw.currCount = 0
        sw.windowStart = sw.windowStart.Add(time.Duration(periods) * sw.windowSize)
        elapsed = now.Sub(sw.windowStart)
    }

    fraction := float64(elapsed) / float64(sw.windowSize)
    estimated := float64(sw.prevCount)*(1-fraction) + float64(sw.currCount)

    if int(estimated) >= sw.limit {
        return false
    }

    sw.currCount++
    return true
}

This gives you a much tighter ceiling. A client that exhausts their quota in the first half of a window will not get a free reset at the boundary — the interpolated calculation accounts for recent history.

Wiring It Into HTTP Middleware

Both implementations expose the same Allow() bool interface, so you can drop either one behind a shared middleware. Here is a per-IP middleware for the standard net/http package:

package main

import (
    "net/http"
    "sync"
    "time"
)

type clientLimiters struct {
    mu       sync.Mutex
    limiters map[string]*SlidingWindow
}

func newClientLimiters() *clientLimiters {
    return &clientLimiters{limiters: make(map[string]*SlidingWindow)}
}

func (cl *clientLimiters) get(ip string) *SlidingWindow {
    cl.mu.Lock()
    defer cl.mu.Unlock()
    if _, ok := cl.limiters[ip]; !ok {
        cl.limiters[ip] = NewSlidingWindow(100, time.Minute)
    }
    return cl.limiters[ip]
}

func RateLimitMiddleware(cl *clientLimiters, next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        ip := r.RemoteAddr
        if !cl.get(ip).Allow() {
            w.Header().Set("Retry-After", "60")
            http.Error(w, "rate limit exceeded", http.StatusTooManyRequests)
            return
        }
        next.ServeHTTP(w, r)
    })
}

func main() {
    cl := newClientLimiters()
    mux := http.NewServeMux()
    mux.HandleFunc("/api/data", func(w http.ResponseWriter, r *http.Request) {
        w.Write([]byte("ok"))
    })
    http.ListenAndServe(":8080", RateLimitMiddleware(cl, mux))
}

One production note: r.RemoteAddr includes the port. Behind a reverse proxy you will want X-Forwarded-For or X-Real-IP, but validate those headers carefully — a client that controls them can bypass per-IP limiting trivially. Trust only traffic from known proxy addresses.

Choosing Between the Two

Token bucket is the right default for general-purpose API rate limiting. It is simple to reason about, tolerates legitimate burst traffic, and is easy to tune. Most third-party integrations expect some burst allowance.

Sliding window is better when you are protecting sensitive endpoints where consistent throughput ceilings matter: authentication routes, password resets, payment processing. It is also easier to communicate to clients: "100 requests per minute, always" rather than "100 per minute with burst capacity."

For multi-instance deployments, move state out of process memory into Redis. The logic stays identical — you are replacing in-process maps with Redis atomic operations (INCR + EXPIRE for fixed windows, or Lua scripts for sliding window accuracy).

Additional hardening to layer on once the basics are working: per-endpoint limits on top of per-client limits, logging of rejected requests with client ID and reason (rejected requests are often the earliest signal of a brute-force or scraping attempt), and alerting when rejection rates spike.

The security hardening checklists we publish include rate limiting baselines for common web stacks — a useful starting point when configuring limits for production.

The Takeaway

Token bucket and sliding window both work. Token bucket is simpler and more permissive; sliding window is stricter and harder to game. The implementations above run in-process with no external dependencies and are straightforward to test with a simple ticker loop.

The bigger trap is skipping rate limiting entirely and bolting it on after an incident. The credential stuffing or scraping event will eventually happen — having limits in place before that day costs a few hours now and saves a lot more later.

I run AYI NEDJIMI Consultants, a cybersecurity consulting firm. We publish free security hardening checklists — PDF and Excel.

How to Detect and Block SQL Injection Attempts Programmatically

Ayi NEDJIMI — Thu, 16 Jul 2026 10:06:42 +0000

SQL injection has been on the OWASP Top 10 for over two decades and still accounts for a significant share of confirmed breaches every year. ORMs and prepared statements solve the root cause, but real applications — legacy admin panels, dynamic report builders, third-party integrations — keep introducing risk faster than teams can audit them. Detecting and blocking injection attempts at the middleware layer gives you a defense-in-depth layer that catches misconfigurations before they reach the database.

Why Regex Alone Falls Short

Most tutorials show a list of patterns and stop there. The problem is that attackers do not send raw payloads. Common evasion techniques include:

URL encoding: ' OR 1=1 -- becomes %27%20OR%201%3D1%20--
Double encoding: %2527 decodes to %27 which decodes to '
Case variation: UnIoN SeLeCt
Comment injection: UN/**/ION SEL/**/ECT
Whitespace substitution: tabs, newlines, and form-feeds instead of spaces

A detection layer that runs pattern matching on the raw input string will miss all of these. The first step is normalization before matching.

Building a SQL Injection Detector in Python

The following module normalizes input through multiple decoding passes before applying compiled regex patterns:

import re
import urllib.parse

SQLI_PATTERNS = [
    r"(\bor\b|\band\b)\s+[\w']+\s*=\s*[\w']+",
    r"union\s+(all\s+)?select",
    r";\s*(drop|delete|truncate|update|insert)\s+",
    r"--\s*($|\s)",
    r"/\*.*?\*/",
    r"\bexec\s*\(",
    r"\bwaitfor\s+delay\b",
    r"\bxp_cmdshell\b",
    r"'\s*;\s*--",
    r"\bload_file\s*\(",
    r"\binto\s+outfile\b",
    r"\bsleep\s*\(\s*\d+",
    r"\bbenchmark\s*\(",
]

COMPILED = [re.compile(p, re.IGNORECASE | re.DOTALL) for p in SQLI_PATTERNS]


def normalize(value: str) -> str:
    # Two decoding passes cover double-encoded payloads
    step1 = urllib.parse.unquote_plus(value)
    step2 = urllib.parse.unquote_plus(step1)
    # Collapse comment-based whitespace substitution
    step3 = re.sub(r"/\*[^*]*\*/", " ", step2)
    return step3.lower().strip()


def is_sqli(value: str) -> bool:
    normalized = normalize(value)
    return any(pattern.search(normalized) for pattern in COMPILED)


def scan_params(params: dict) -> list[str]:
    flagged = []
    for key, value in params.items():
        if isinstance(value, list):
            if any(is_sqli(str(v)) for v in value):
                flagged.append(key)
        elif is_sqli(str(value)):
            flagged.append(key)
    return flagged

The two-pass URL decoding is deliberate: %2527 → %27 → '. Without it, double-encoded payloads pass through clean. The comment stripping handles UN/**/ION before pattern matching runs.

FastAPI Middleware Integration

Wire the detector into a middleware that inspects query parameters and JSON bodies on every request:

from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
import logging
import json
import datetime

app = FastAPI()
logger = logging.getLogger("sqli_guard")


def log_event(ip: str, path: str, flagged: list[str], raw: dict) -> None:
    event = {
        "timestamp": datetime.datetime.utcnow().isoformat() + "Z",
        "event_type": "sqli_attempt",
        "source_ip": ip,
        "endpoint": path,
        "flagged_params": flagged,
        "raw_values": {k: raw.get(k) for k in flagged},
    }
    logger.warning(json.dumps(event))


@app.middleware("http")
async def sqli_guard(request: Request, call_next):
    all_params: dict = {}
    flagged: list[str] = []

    # Query string
    query_params = dict(request.query_params)
    all_params.update(query_params)
    flagged += scan_params(query_params)

    # JSON body
    if request.method in ("POST", "PUT", "PATCH"):
        ctype = request.headers.get("content-type", "")
        if "application/json" in ctype:
            try:
                body = await request.json()
                if isinstance(body, dict):
                    all_params.update(body)
                    flagged += scan_params(body)
            except Exception:
                pass

    if flagged:
        ip = request.client.host if request.client else "unknown"
        log_event(ip, str(request.url.path), flagged, all_params)
        # Return 400, not 403 — avoid confirming a WAF is present
        return JSONResponse(
            status_code=400,
            content={"error": "Invalid request parameters"},
        )

    return await call_next(request)

The 400 vs 403 choice matters. A 403 tells an attacker a detection layer blocked them and encourages payload variation. A generic 400 looks like a validation error and leaks less information.

AST-Based Detection for Dynamic Query Builders

When you control query construction — a report builder, an admin search — validate the generated SQL structurally before execution. The sqlglot library parses SQL into an AST without needing a live database connection:

import sqlglot
from sqlglot import exp


def is_safe_query(query: str) -> bool:
    """
    Returns False if the query contains structural anomalies
    that should never appear from legitimate application code.
    """
    try:
        statements = sqlglot.parse(query, error_level=sqlglot.ErrorLevel.IGNORE)
    except Exception:
        return False  # Parse failure is itself suspicious

    # Stacked statements are almost always injection
    if len(statements) > 1:
        return False

    if not statements:
        return False

    dangerous_node_types = (
        exp.Drop,
        exp.Truncate,
        exp.Command,
        exp.Union,
    )

    for node in statements[0].walk():
        if isinstance(node, dangerous_node_types):
            return False

    return True


# Usage before executing a dynamically constructed query
def run_report_query(user_filters: str, db_cursor):
    base = "SELECT id, name, created_at FROM reports WHERE "
    query = base + user_filters  # legacy code you cannot rewrite today

    if not is_safe_query(query):
        raise ValueError("Query structure rejected by safety check")

    db_cursor.execute(query)
    return db_cursor.fetchall()

sqlglot is dialect-aware (PostgreSQL, MySQL, SQLite, BigQuery, and more) and catches structural anomalies that bypass lexical checks. Second-order injection — where a payload is stored and later interpolated into a query — is only catchable at query execution time, which is exactly why this layer matters even when inputs were validated at ingestion.

Testing Your Detection Layer

Use sqlmap in a staging environment against your own API to verify coverage:

# Basic scan against a query parameter
sqlmap -u "http://localhost:8000/search?q=test" --level=3 --risk=2 --batch

# Test a JSON body endpoint
sqlmap -u "http://localhost:8000/users" \
  --data='{"email":"test@example.com"}' \
  --content-type="application/json" \
  --level=3 --batch

Check that your middleware logs every blocked attempt and that no sqlmap payload gets a 200 response. If any slip through, add the specific pattern to SQLI_PATTERNS with a comment noting which evasion technique it targets.

For teams building out a structured security review process, a security hardening checklist covering injection prevention, authentication controls, and HTTP security headers is a practical complement to programmatic detection.

The Takeaway

This detection layer is not a replacement for parameterized queries — it is a second line of defense. The correct order of operations:

Use prepared statements or an ORM for all database interaction (fix the root cause)
Add middleware-level detection to catch misconfigurations in code you have not audited yet
Log structured events and feed them into your incident response pipeline
Run sqlmap against staging on every major release to verify coverage has not regressed

The middleware shown here adds minimal latency — one regex pass per request against compiled patterns — and gives your security team early visibility into who is probing your endpoints before they find something that works. A sudden spike of SQLi attempts against a single endpoint frequently precedes a targeted manual exploit attempt on that route.

I run AYI NEDJIMI Consultants, a cybersecurity consulting firm. We publish free security hardening checklists — PDF and Excel.

API Key Rotation Without Downtime: Patterns and Implementation

Ayi NEDJIMI — Wed, 15 Jul 2026 10:06:07 +0000

Rotating API keys is routine maintenance — until it causes a production incident. The window between revoking an old key and deploying a new one is where things break: services return 401s, alerts fire, and someone spends 20 minutes tracing an outage back to a key swap they thought was safe. This article covers two patterns that make rotation genuinely zero-downtime, with working code for both.

Why Naive Rotation Breaks Things

The standard advice: generate a new key, update your environment variable, redeploy. Simple. But this creates a race condition that's easy to miss in testing and painful in production:

T=0: Old key revoked
T=+N (seconds to minutes): New key propagated to all running instances
During T=0 to T=+N: Every API call returns 401

Even "zero-downtime" deployment strategies don't help here if the key is injected into the container environment at startup. Rolling updates replace pods gradually — the old pods carry the old key, the new pods get the new key, but if the old key was revoked before the rollout completes, the old pods start failing.

The fix isn't to deploy faster — it's to accept both keys simultaneously during a transition window.

Pattern 1: Dual-Key Rotation

The principle: your validation logic always accepts two keys — the current and the previous. When you rotate, the old current becomes the new previous (with an expiry timestamp), and a fresh key becomes current. During the overlap window, both keys work.

import time
import secrets
import hashlib
from dataclasses import dataclass
from typing import Optional

@dataclass
class StoredKey:
    key_hash: str
    created_at: float
    expires_at: Optional[float] = None

class DualKeyStore:
    def __init__(self):
        self.current: Optional[StoredKey] = None
        self.previous: Optional[StoredKey] = None

    @staticmethod
    def _hash(raw: str) -> str:
        return hashlib.sha256(raw.encode()).hexdigest()

    def rotate(self, overlap_seconds: int = 300) -> str:
        # Returns the new raw key. Store it securely — shown only once.
        raw = secrets.token_urlsafe(32)
        new_key = StoredKey(key_hash=self._hash(raw), created_at=time.time())

        if self.current:
            self.current.expires_at = time.time() + overlap_seconds
            self.previous = self.current

        self.current = new_key
        return raw

    def is_valid(self, raw: str) -> bool:
        h = self._hash(raw)
        now = time.time()

        if self.current and self.current.key_hash == h:
            return True

        if self.previous and self.previous.key_hash == h:
            if self.previous.expires_at is None or self.previous.expires_at > now:
                return True

        return False

A few things worth noting: raw keys are never stored — only SHA-256 hashes. The raw key is returned exactly once, at generation. The overlap_seconds parameter controls how long old clients have to migrate before their key stops working; 300 seconds is a reasonable default for most deployment pipelines.

Pattern 2: Versioned Key Tokens

A cleaner approach, similar to how Stripe structures its API keys: embed a version identifier in the key itself. The validator extracts the version, loads the corresponding signing secret, and verifies. Rotation means adding a new version and setting an expiry on the old one.

package apikey

import (
    "crypto/hmac"
    "crypto/sha256"
    "encoding/base64"
    "fmt"
    "strings"
    "time"
)

type KeyVersion struct {
    Secret    []byte
    ExpiresAt time.Time
}

type VersionedStore struct {
    versions map[int]*KeyVersion
    latest   int
}

func NewVersionedStore() *VersionedStore {
    return &VersionedStore{versions: make(map[int]*KeyVersion)}
}

func (s *VersionedStore) AddVersion(v int, secret []byte, expiresAt time.Time) {
    s.versions[v] = &KeyVersion{Secret: secret, ExpiresAt: expiresAt}
    if v > s.latest {
        s.latest = v
    }
}

// Issue returns a signed token embedding the version and client ID.
func (s *VersionedStore) Issue(clientID string) string {
    kv := s.versions[s.latest]
    mac := hmac.New(sha256.New, kv.Secret)
    mac.Write([]byte(clientID))
    sig := base64.RawURLEncoding.EncodeToString(mac.Sum(nil))
    return fmt.Sprintf("v%d_%s_%s", s.latest, clientID, sig)
}

// Validate returns (clientID, true) on success.
func (s *VersionedStore) Validate(token string) (string, bool) {
    parts := strings.SplitN(token, "_", 3)
    if len(parts) != 3 {
        return "", false
    }
    var ver int
    if _, err := fmt.Sscanf(parts[0], "v%d", &ver); err != nil {
        return "", false
    }
    clientID, sig := parts[1], parts[2]

    kv, ok := s.versions[ver]
    if !ok || (!kv.ExpiresAt.IsZero() && time.Now().After(kv.ExpiresAt)) {
        return "", false
    }

    mac := hmac.New(sha256.New, kv.Secret)
    mac.Write([]byte(clientID))
    expected := base64.RawURLEncoding.EncodeToString(mac.Sum(nil))
    if !hmac.Equal([]byte(sig), []byte(expected)) {
        return "", false
    }
    return clientID, true
}

To rotate: call AddVersion with a new version number and a fresh secret. Set ExpiresAt on the old version to time.Now().Add(10 * time.Minute). Both versions validate during the overlap window; expired versions automatically return ("", false) afterward.

The version prefix in the token (v2_, v3_) is not a security feature — don't rely on it alone. The HMAC signature is what provides authenticity.

Propagating the New Key

The validation logic is the easy part. The harder problem is making sure every service instance receives the new key before the old one expires.

Three strategies in increasing order of robustness:

Rolling restart with secret manager: Store the current key in your secret manager (Vault, AWS Secrets Manager, Doppler). When rotating, update the secret, then trigger a rolling restart. Each new pod reads the current key at startup. This works if your overlap window exceeds your deployment time — which isn't always true for large clusters.

Hot reload: The application polls the secret manager every 30–60 seconds and updates its in-memory key store without restarting. This decouples rotation from deployment entirely. The downside is added complexity: you need to handle the poll gracefully — don't crash if the secret manager is temporarily unavailable.

Internal key endpoint: Consumers call an internal /internal/keys/active endpoint to retrieve the current key. Your key service handles rotation server-side; clients pick up changes on the next poll or at startup. This is how most high-availability systems are wired — it makes rotation a pure backend operation with no client coordination required.

For teams looking to integrate these patterns into a broader hardening strategy, the security hardening checklists on our site cover key lifecycle management alongside secrets scanning, CI/CD controls, and runtime monitoring.

Edge Cases to Handle

Clock skew: Expiry timestamps compared across multiple nodes can fail unexpectedly if clocks differ by more than a few seconds. Add a 30-second skew buffer to your expiry window, or synchronize with NTP.

Concurrent rotation: If two operators trigger a rotation simultaneously, you may lose track of which key is "previous." Wrap the rotate operation in a distributed lock — a Redis SET NX with a TTL or a database advisory lock is sufficient.

Emergency revocation: You need a break-glass path to immediately invalidate a key, bypassing the overlap window. In the Python example, set expires_at = time.time() - 1 on both current and previous. In the Go version, set ExpiresAt = time.Now().Add(-time.Second) on the affected version. Test this path before you need it.

Audit logs: Log every key validation failure with timestamp, source IP, and which key version was attempted. A sudden spike in failures from an unexpected IP is often the first signal that a compromised key is being tested elsewhere.

The Takeaway

The rule is simple: never let the system reach a state where no valid key exists. Both patterns above enforce this — dual-key rotation through an explicit overlap window, versioned keys through version-scoped expiry.

Your overlap window length is the only real tuning parameter. Make it longer than your worst-case deployment time, but short enough that a leaked key doesn't stay valid indefinitely. Five to ten minutes covers most pipelines without being reckless.

Pick the pattern that matches your infrastructure: dual-key is simpler to implement and reason about; versioned tokens are cleaner at scale and give you an audit trail per key version baked into the token format itself.

I run AYI NEDJIMI Consultants, a cybersecurity consulting firm. We publish free security hardening checklists — PDF and Excel.

Building a Secrets Scanner for CI/CD Pipelines

Ayi NEDJIMI — Tue, 14 Jul 2026 10:04:31 +0000

Hardcoded secrets are the gift that keeps on giving — to attackers. API keys, tokens, database passwords, and private keys end up in repositories more often than you'd think, and once committed, they live in git history forever unless you rewrite it. Building a scanner that runs in your CI/CD pipeline catches them before they propagate.

Why Default Solutions Fall Short

Tools like git-secrets or trufflehog are useful but come with tradeoffs: they're either too slow for pre-commit hooks, or they miss context-specific patterns your codebase uses. More importantly, most teams install them and forget about them — no visibility into what's being flagged, no metrics, no alerting.

What you actually want is a lightweight scanner you control, with:

Custom patterns for your stack (internal tokens, environment-specific formats)
Entropy analysis for base64/hex blobs that don't match known patterns
A clean CI integration that fails fast and reports clearly
JSON output you can pipe into a SIEM or Slack notification

Let's build it.

The Scanner Core in Python

The foundation is a recursive file walker with regex matching against a curated pattern list.

import re
import os
import json
import math
import sys
from pathlib import Path
from dataclasses import dataclass
from typing import Optional

PATTERNS = {
    "aws_access_key":      r"AKIA[0-9A-Z]{16}",
    "aws_secret_key":      r"(?i)aws.{0,20}secret.{0,20}['\"][0-9a-zA-Z/+]{40}['\"]",
    "github_token":        r"ghp_[0-9a-zA-Z]{36}",
    "github_classic_pat":  r"github_pat_[0-9a-zA-Z_]{82}",
    "stripe_secret":       r"sk_(live|test)_[0-9a-zA-Z]{24,}",
    "stripe_publishable":  r"pk_(live|test)_[0-9a-zA-Z]{24,}",
    "sendgrid_api_key":    r"SG\.[0-9a-zA-Z\-_]{22}\.[0-9a-zA-Z\-_]{43}",
    "jwt_token":           r"eyJ[a-zA-Z0-9_-]{10,}\.[a-zA-Z0-9_-]{10,}\.[a-zA-Z0-9_-]{10,}",
    "private_key_header":  r"-----BEGIN (RSA |EC |OPENSSH )?PRIVATE KEY-----",
    "generic_password":    r"(?i)(password|passwd|pwd)\s*[:=]\s*['\"][^'\"]{8,}['\"]",
    "generic_api_key":     r"(?i)(api[_-]?key|apikey)\s*[:=]\s*['\"][^'\"]{16,}['\"]",
    "db_connection_str":   r"(?i)(mongodb|postgres|mysql|redis)://[^@\s]+:[^@\s]+@",
}

SKIP_EXTENSIONS = {".png", ".jpg", ".jpeg", ".gif", ".svg", ".ico",
                   ".pdf", ".zip", ".tar", ".gz", ".bin", ".exe", ".lock"}

SKIP_DIRS = {".git", "node_modules", "__pycache__", ".venv", "vendor", "dist", "build"}


@dataclass
class Finding:
    file: str
    line: int
    pattern_name: str
    matched_text: str
    entropy: Optional[float] = None


def shannon_entropy(data: str) -> float:
    if not data:
        return 0.0
    freq: dict[str, int] = {}
    for c in data:
        freq[c] = freq.get(c, 0) + 1
    return -sum(
        (count / len(data)) * math.log2(count / len(data))
        for count in freq.values()
    )


def scan_file(path: Path, compiled: dict) -> list[Finding]:
    findings = []
    try:
        text = path.read_text(errors="replace")
    except (PermissionError, IsADirectoryError):
        return findings

    for lineno, line in enumerate(text.splitlines(), 1):
        for name, pattern in compiled.items():
            match = pattern.search(line)
            if match:
                matched = match.group(0)
                entropy = shannon_entropy(matched) if len(matched) > 12 else None
                findings.append(Finding(
                    file=str(path),
                    line=lineno,
                    pattern_name=name,
                    matched_text=matched[:80],
                    entropy=round(entropy, 2) if entropy else None,
                ))
    return findings


def scan_directory(root: str) -> list[Finding]:
    compiled = {name: re.compile(pat) for name, pat in PATTERNS.items()}
    all_findings: list[Finding] = []

    for dirpath, dirnames, filenames in os.walk(root):
        dirnames[:] = [d for d in dirnames if d not in SKIP_DIRS]
        for filename in filenames:
            path = Path(dirpath) / filename
            if path.suffix.lower() in SKIP_EXTENSIONS:
                continue
            all_findings.extend(scan_file(path, compiled))

    return all_findings


if __name__ == "__main__":
    root = sys.argv[1] if len(sys.argv) > 1 else "."
    findings = scan_directory(root)
    output = [vars(f) for f in findings]
    print(json.dumps(output, indent=2))
    sys.exit(1 if findings else 0)

The exit code matters: sys.exit(1) on findings makes your CI pipeline fail automatically without any additional configuration.

Entropy-Based Detection

Regex patterns catch known formats. High-entropy strings catch everything else — random tokens, secrets generated by internal systems, base64-encoded keys that don't match a known provider's format.

Shannon entropy measures randomness. A typical English word has entropy around 3–4 bits per character. A properly generated API key or secret usually lands above 4.5.

Add a second pass that flags high-entropy strings in common assignment contexts:

import re
import math

HIGH_ENTROPY_PATTERN = re.compile(
    r"(?ix)"
    r"(?:secret|token|key|password|credential|auth)[^\n]{0,10}"
    r"['\"\s=:]+"
    r"([a-zA-Z0-9+/=_\-]{20,})"
)

ENTROPY_THRESHOLD = 4.5


def scan_for_high_entropy(text: str, filename: str) -> list[dict]:
    results = []
    for lineno, line in enumerate(text.splitlines(), 1):
        for match in HIGH_ENTROPY_PATTERN.finditer(line):
            candidate = match.group(1)
            entropy = shannon_entropy(candidate)
            if entropy >= ENTROPY_THRESHOLD:
                results.append({
                    "file": filename,
                    "line": lineno,
                    "pattern_name": "high_entropy_string",
                    "matched_text": candidate[:60],
                    "entropy": round(entropy, 2),
                })
    return results

Run this after the pattern pass. A combined signal — pattern match and high entropy — gives you very high confidence it's a real secret rather than a false positive like a CSS hash or a long test fixture ID.

Wiring It Into CI/CD

GitHub Actions

Create .github/workflows/secrets-scan.yml:

name: Secrets Scan

on:
  push:
    branches: ["**"]
  pull_request:

jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - name: Run secrets scanner
        run: |
          python scripts/scan_secrets.py . 2>&1 | tee scan-results.json
      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: secrets-scan-results
          path: scan-results.json

fetch-depth: 0 fetches full history, which you need if you want to diff only changed files across branches. For a full-repo scan on every push, the default shallow clone works fine.

For GitLab CI, the equivalent is a job in .gitlab-ci.yml that runs the same script with allow_failure: false and uploads the JSON as an artifact.

Reducing False Positives Without Losing Coverage

The biggest operational problem with secrets scanners is alert fatigue. If your scanner fires on every test fixture or example config, engineers start ignoring it — which is worse than having no scanner at all.

Allowlist by file path, not pattern. Adding a pattern to an ignore list silences it everywhere. Instead, allow specific directories: tests/fixtures/, docs/examples/, .env.example. Your scanner stays strict on production code while being quiet on intentional examples.

Require entropy threshold for generic patterns. The generic_password and generic_api_key patterns above will generate noise if used alone. Only surface them when the matched value also has entropy above 3.8.

Be strict about format. For AWS keys, verify that AKIA is followed by exactly 16 uppercase alphanumeric characters. Tightening the regex almost eliminates false positives for known service formats.

Diff-only mode for PRs, full-scan on main. On pull requests, scan only the changed lines. On pushes to main, scan the full tree. This keeps PR checks fast without reducing coverage where it matters most.

For teams managing multiple services, pairing your scanner with a standardized security hardening checklist makes it easier to enforce consistent configuration across repos — same patterns, same thresholds, same allowlist policy applied everywhere.

The Takeaway

A secrets scanner in CI is table stakes for any production system. The implementation above is roughly 100 lines of Python with zero dependencies beyond the standard library — there is no excuse not to run it.

The key design decisions: explicit pattern names (so findings are actionable, not just "found something"), entropy scoring (so you catch what regex misses), and a non-zero exit code on findings (so CI actually fails instead of just logging). Start with the pattern list here, extend it for your internal token formats, and wire it into your pipeline before the next deploy.

I run AYI NEDJIMI Consultants, a cybersecurity consulting firm. We publish free security hardening checklists — PDF and Excel.

Implementing JWT Authentication Securely in Go Fiber

Ayi NEDJIMI — Mon, 13 Jul 2026 10:06:47 +0000

JWT authentication is one of those things that looks simple until you see the CVEs. Algorithm confusion, weak secrets, missing expiry — most JWT vulnerabilities aren't exotic. They come from implementation shortcuts that seemed harmless at the time. This article covers how to do JWT auth correctly in Go Fiber, including token generation, validation, and the refresh flow.

Why JWT Auth Fails in Practice

The spec is fine. The failures happen in implementation:

Algorithm confusion attacks: if you don't explicitly lock the expected signing algorithm, an attacker can switch from RS256 to HS256 and sign tokens with your public key
Weak secrets: HS256 with a short or predictable secret is brute-forceable — the JWT spec recommends 256 bits (32 bytes) minimum
Missing expiry: tokens that live forever turn a leaked credential into a permanent backdoor
No revocation mechanism: a long-lived token you can't invalidate is a liability

Fixing these requires maybe 20 extra lines of code. Let's go through them.

Project Setup

We need three dependencies:

go get github.com/gofiber/fiber/v2
go get github.com/golang-jwt/jwt/v5
go get github.com/redis/go-redis/v9

Use golang-jwt/jwt — it's the actively maintained fork of dgrijalva/jwt-go, which was archived in 2021. Don't use the original.

Project structure:

.
├── main.go
├── handlers/
│   ├── auth.go
│   └── refresh.go
└── middleware/
    └── jwt.go

Generating Tokens

Two hard rules: use a secret of at least 32 characters, and always set an expiry. Both should be enforced at the code level, not left to documentation.

// handlers/auth.go
package handlers

import (
    "fmt"
    "os"
    "time"

    "github.com/golang-jwt/jwt/v5"
)

type Claims struct {
    UserID string `json:"sub"`
    Role   string `json:"role"`
    jwt.RegisteredClaims
}

func GenerateAccessToken(userID, role string) (string, error) {
    secret := os.Getenv("JWT_SECRET")
    if len(secret) < 32 {
        return "", fmt.Errorf("JWT_SECRET must be at least 32 characters, got %d", len(secret))
    }

    claims := Claims{
        UserID: userID,
        Role:   role,
        RegisteredClaims: jwt.RegisteredClaims{
            ExpiresAt: jwt.NewNumericDate(time.Now().Add(15 * time.Minute)),
            IssuedAt:  jwt.NewNumericDate(time.Now()),
            Issuer:    "myapp",
            Subject:   userID,
        },
    }

    token := jwt.NewWithClaims(jwt.SigningMethodHS256, claims)
    return token.SignedString([]byte(secret))
}

15-minute access tokens are short on purpose. Pair them with refresh tokens (covered next) rather than extending the TTL to avoid dealing with expiry.

Validating Tokens: Where Most Bugs Live

The algorithm confusion attack works like this: your server signs tokens with HS256 using a secret. An attacker discovers your RSA public key (often exposed at /.well-known/jwks.json). They craft a token, set alg: HS256, and sign it with your public key as the HMAC secret. If your parser accepts whichever algorithm the token claims, it will verify this forged token as valid.

The fix is a single type assertion in the key function:

// middleware/jwt.go
package middleware

import (
    "fmt"
    "os"

    "github.com/gofiber/fiber/v2"
    "github.com/golang-jwt/jwt/v5"
    "yourapp/handlers"
)

func JWTProtected() fiber.Handler {
    return func(c *fiber.Ctx) error {
        authHeader := c.Get("Authorization")
        if len(authHeader) < 8 || authHeader[:7] != "Bearer " {
            return c.Status(fiber.StatusUnauthorized).JSON(fiber.Map{
                "error": "missing or malformed authorization header",
            })
        }

        tokenStr := authHeader[7:]
        secret := []byte(os.Getenv("JWT_SECRET"))

        token, err := jwt.ParseWithClaims(
            tokenStr,
            &handlers.Claims{},
            func(t *jwt.Token) (interface{}, error) {
                // This check prevents algorithm confusion attacks
                if _, ok := t.Method.(*jwt.SigningMethodHMAC); !ok {
                    return nil, fmt.Errorf("unexpected signing method: %v", t.Header["alg"])
                }
                return secret, nil
            },
        )

        if err != nil || !token.Valid {
            return c.Status(fiber.StatusUnauthorized).JSON(fiber.Map{
                "error": "invalid or expired token",
            })
        }

        claims, ok := token.Claims.(*handlers.Claims)
        if !ok {
            return c.Status(fiber.StatusUnauthorized).JSON(fiber.Map{"error": "invalid claims"})
        }

        c.Locals("userID", claims.UserID)
        c.Locals("role", claims.Role)
        return c.Next()
    }
}

The t.Method.(*jwt.SigningMethodHMAC) assertion is the guard. If the token arrives claiming alg: RS256 (or none), the assertion fails, the key function returns an error, and the token is rejected before any further processing.

Wiring It Into Fiber

// main.go
package main

import (
    "log"

    "github.com/gofiber/fiber/v2"
    "github.com/joho/godotenv"
    "yourapp/handlers"
    "yourapp/middleware"
)

func main() {
    if err := godotenv.Load(); err != nil {
        log.Fatal("could not load .env file")
    }

    app := fiber.New()

    // Public routes
    app.Post("/auth/login", handlers.Login)
    app.Post("/auth/refresh", handlers.RefreshToken)

    // Protected routes — middleware applied at group level
    api := app.Group("/api", middleware.JWTProtected())
    api.Get("/profile", handlers.GetProfile)
    api.Post("/settings", handlers.UpdateSettings)

    log.Fatal(app.Listen(":3000"))
}

Applying the middleware at the group level means new routes added under /api are protected by default — no risk of accidentally exposing a route by forgetting to add a middleware decorator.

Refresh Tokens: Making Short Expiry Practical

Short access tokens only work if refresh is seamless. Store refresh tokens in Redis with a 7-day TTL. On each use, rotate them — issue a new refresh token and delete the old one. This limits the exposure window for a stolen token.

// handlers/refresh.go
package handlers

import (
    "context"
    "crypto/rand"
    "encoding/hex"
    "time"

    "github.com/gofiber/fiber/v2"
    "github.com/redis/go-redis/v9"
)

var RDB *redis.Client

func generateSecureToken() string {
    b := make([]byte, 32)
    rand.Read(b)
    return hex.EncodeToString(b)
}

func RefreshToken(c *fiber.Ctx) error {
    body := struct {
        RefreshToken string `json:"refresh_token"`
    }{}
    if err := c.BodyParser(&body); err != nil || body.RefreshToken == "" {
        return c.Status(fiber.StatusBadRequest).JSON(fiber.Map{"error": "invalid request body"})
    }

    ctx := context.Background()
    key := "refresh:" + body.RefreshToken

    userID, err := RDB.Get(ctx, key).Result()
    if err == redis.Nil {
        return c.Status(fiber.StatusUnauthorized).JSON(fiber.Map{"error": "refresh token invalid or expired"})
    }
    if err != nil {
        return c.Status(fiber.StatusInternalServerError).JSON(fiber.Map{"error": "internal error"})
    }

    // Rotate: delete old token and issue new one atomically
    pipe := RDB.Pipeline()
    pipe.Del(ctx, key)
    newRefreshToken := generateSecureToken()
    pipe.Set(ctx, "refresh:"+newRefreshToken, userID, 7*24*time.Hour)
    if _, err := pipe.Exec(ctx); err != nil {
        return c.Status(fiber.StatusInternalServerError).JSON(fiber.Map{"error": "token rotation failed"})
    }

    accessToken, err := GenerateAccessToken(userID, "user")
    if err != nil {
        return c.Status(fiber.StatusInternalServerError).JSON(fiber.Map{"error": "token generation failed"})
    }

    return c.JSON(fiber.Map{
        "access_token":  accessToken,
        "refresh_token": newRefreshToken,
    })
}

Using a Redis pipeline for the rotation makes both operations (delete old, insert new) atomic from the client's perspective, which avoids race conditions on concurrent requests arriving with the same token.

Note that crypto/rand is used for generating the refresh token — never math/rand. The difference is cryptographic unpredictability.

The Takeaway

JWT is a solved problem when you implement it without shortcuts:

Lock the signing algorithm in the key function — the type assertion is non-negotiable and stops algorithm confusion cold
Keep access tokens short (15 minutes) and pair them with rotating, server-stored refresh tokens
Enforce secret length at startup — fail fast with a clear error rather than silently accepting a weak secret
Use golang-jwt/jwt/v5, not the archived original package

These aren't edge cases. Algorithm confusion has a CVE (CVE-2015-9235), weak secrets get cracked in seconds with offline attacks, and tokens without expiry are a staple of incident post-mortems.

For teams doing a broader API security review, the free security hardening checklists cover JWT configuration alongside session management, CORS, and input validation — available as PDF and Excel.

I run AYI NEDJIMI Consultants, a cybersecurity consulting firm. We publish free security hardening checklists — PDF and Excel.

How to Build a Real-Time Threat Intelligence Feed Aggregator in Python

Ayi NEDJIMI — Sun, 12 Jul 2026 10:02:12 +0000

Security teams drown in alerts. One underappreciated root cause: nobody is systematically collecting and correlating threat intelligence before incidents happen. A threat intelligence feed aggregator pulls IOCs (indicators of compromise) from multiple sources — OSINT feeds, commercial APIs, government advisories — into a single normalized store you can query in real time.

Here's how to build a production-ready one in Python.

What We're Building

The goal is a daemon that:

Polls multiple threat intel sources on a schedule (AbuseIPDB, AlienVault OTX, CISA KEV)
Normalizes IOCs into a common schema
Stores them in SQLite (easy to swap for Postgres later)
Exposes a lookup function to check if a given IP, domain, or hash is known-bad

We're skipping STIX/TAXII for simplicity, but the data model is compatible if you need to go that route later.

Defining the Data Model

Every source has different field names, different formats, different confidence scores. The first thing you need is a normalized IOC schema:

from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum

class IOCType(str, Enum):
    IP = "ip"
    DOMAIN = "domain"
    URL = "url"
    HASH_MD5 = "md5"
    HASH_SHA256 = "sha256"
    CVE = "cve"

@dataclass
class IOC:
    value: str
    ioc_type: IOCType
    source: str
    confidence: int  # 0-100
    tags: list[str] = field(default_factory=list)
    first_seen: datetime = field(default_factory=datetime.utcnow)
    last_seen: datetime = field(default_factory=datetime.utcnow)
    raw: dict = field(default_factory=dict)  # preserve original payload

Keep the raw payload. You'll want it when a field matters later that you didn't normalize today.

Writing the Feed Adapters

Each source gets its own adapter class that returns a list of IOC objects. Here's a minimal one for AbuseIPDB:

import httpx
from datetime import datetime

class AbuseIPDBAdapter:
    BASE_URL = "https://api.abuseipdb.com/api/v2"

    def __init__(self, api_key: str):
        self.api_key = api_key

    def fetch(self, limit: int = 1000) -> list[IOC]:
        headers = {"Key": self.api_key, "Accept": "application/json"}
        params = {"confidenceMinimum": 75, "limit": limit}
        resp = httpx.get(
            f"{self.BASE_URL}/blacklist",
            headers=headers,
            params=params,
            timeout=30,
        )
        resp.raise_for_status()
        data = resp.json()

        iocs = []
        for entry in data.get("data", []):
            iocs.append(IOC(
                value=entry["ipAddress"],
                ioc_type=IOCType.IP,
                source="abuseipdb",
                confidence=entry.get("abuseConfidenceScore", 75),
                tags=entry.get("usageType", "").split(","),
                last_seen=datetime.fromisoformat(
                    entry["lastReportedAt"].replace("Z", "+00:00")
                ),
                raw=entry,
            ))
        return iocs

The pattern is always the same: fetch() returns list[IOC]. This makes the aggregation loop trivial to extend with new sources.

The Aggregation Loop

We want this running continuously, each adapter on its own schedule (CVE feeds change daily, IP reputation can change hourly):

import sqlite3
import schedule
import time
import json
import logging
from pathlib import Path

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger(__name__)

DB_PATH = Path("threat_intel.db")

def init_db(conn: sqlite3.Connection):
    conn.execute("""
        CREATE TABLE IF NOT EXISTS iocs (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            value TEXT NOT NULL,
            ioc_type TEXT NOT NULL,
            source TEXT NOT NULL,
            confidence INTEGER DEFAULT 0,
            tags TEXT DEFAULT '[]',
            first_seen TEXT,
            last_seen TEXT,
            raw TEXT DEFAULT '{}',
            UNIQUE(value, source)
        )
    """)
    conn.execute("CREATE INDEX IF NOT EXISTS idx_iocs_value ON iocs(value)")
    conn.commit()

def upsert_iocs(conn: sqlite3.Connection, iocs: list[IOC]):
    for ioc in iocs:
        conn.execute("""
            INSERT INTO iocs
              (value, ioc_type, source, confidence, tags, first_seen, last_seen, raw)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?)
            ON CONFLICT(value, source) DO UPDATE SET
                confidence = excluded.confidence,
                last_seen  = excluded.last_seen,
                tags       = excluded.tags,
                raw        = excluded.raw
        """, (
            ioc.value, ioc.ioc_type.value, ioc.source, ioc.confidence,
            json.dumps(ioc.tags), ioc.first_seen.isoformat(),
            ioc.last_seen.isoformat(), json.dumps(ioc.raw),
        ))
    conn.commit()
    logger.info("Upserted %d IOCs from %s", len(iocs), iocs[0].source if iocs else "?")

def run_adapter(adapter, conn: sqlite3.Connection):
    try:
        iocs = adapter.fetch()
        upsert_iocs(conn, iocs)
    except Exception as e:
        logger.error("Adapter %s failed: %s", type(adapter).__name__, e)

def main():
    conn = sqlite3.connect(DB_PATH)
    init_db(conn)

    abuse_adapter = AbuseIPDBAdapter(api_key="YOUR_ABUSEIPDB_KEY")

    schedule.every(1).hours.do(run_adapter, abuse_adapter, conn)
    run_adapter(abuse_adapter, conn)  # run once on startup

    while True:
        schedule.run_pending()
        time.sleep(30)

if __name__ == "__main__":
    main()

A few things worth noting: error handling per adapter means one failing source doesn't block the others. The UNIQUE(value, source) constraint plus upsert means you can safely re-run without duplicates. The 30-second sleep in the loop is intentional — schedule is not async, and you don't need async here.

Querying the Store

The point of all this is fast lookups at runtime — when a log event contains an IP or domain, you want to know immediately if it's in your intel store:

def lookup_ioc(
    conn: sqlite3.Connection,
    value: str,
    min_confidence: int = 70,
) -> list[dict]:
    cursor = conn.execute("""
        SELECT value, ioc_type, source, confidence, tags, last_seen
        FROM iocs
        WHERE value = ? AND confidence >= ?
        ORDER BY confidence DESC
    """, (value, min_confidence))
    return [
        {
            "value": r[0], "type": r[1], "source": r[2],
            "confidence": r[3], "tags": json.loads(r[4]), "last_seen": r[5],
        }
        for r in cursor.fetchall()
    ]

# Integration example: check every extracted IP from a log line
matches = lookup_ioc(conn, "185.220.101.45")
if matches:
    logger.warning("Known threat actor: %s", matches)

The min_confidence parameter matters. Set it to 50 and you'll get noisy results. Set it to 90 and you'll miss things. 70 is a reasonable starting point — tune it per source based on observed false positive rates in your environment.

What to Add Next

This scaffold handles the core use case in under 200 lines. Here's what a production deployment adds:

More feed sources: AlienVault OTX has a Python SDK (OTXv2). Feodo Tracker is free with no auth — just a CSV download. The CISA KEV catalog is a JSON file updated daily at a stable URL.

Operational hardening:

Rate limiting per adapter — most free-tier APIs have hourly caps. Add time.sleep() between requests or a token bucket.
IOC expiration — IPs rotate. Add a scheduled cleanup: DELETE FROM iocs WHERE last_seen < datetime('now', '-30 days').
Parallel fetching — once you have five adapters, switch to concurrent.futures.ThreadPoolExecutor for the fetch phase. Keep DB writes single-threaded to avoid contention.
Alerting — call a Telegram or Slack webhook inside run_adapter when high-confidence IOCs are found.

For teams who want to go deeper on hardening the infrastructure around this kind of tooling — API exposure, secret management, network segmentation — the security checklists we publish cover those patterns step by step.

The Takeaway

A threat intel aggregator is not a 10,000-line project. The core is under 200 lines: a normalized data model, one adapter class per source, a scheduling loop, and an indexed SQLite store. Start there, add sources one by one, and measure your false positive rate before adding complexity.

The mistake most teams make is waiting for a commercial TIP budget that never gets approved. You can have something useful running this afternoon.

I run AYI NEDJIMI Consultants, a cybersecurity consulting firm. We publish free security hardening checklists — PDF and Excel.

Semantic deduplication for large text datasets

Ayi NEDJIMI — Sat, 11 Jul 2026 10:03:21 +0000

When you build a dataset for ML training or a RAG knowledge base, exact deduplication is not enough. Copy-paste duplicates are easy to catch with a hash. Paraphrases, reformulations, and semantically equivalent sentences are not. Running standard MinHash or Jaccard on them gives near-zero similarity even when they carry identical information. The result: bloated corpora, biased models, and retrieval systems that return the same fact dressed in different words.

Semantic deduplication fixes this by comparing meaning instead of tokens.

Why Exact and Fuzzy Dedup Fall Short

Exact deduplication works by hashing document content — fast, but it only catches bit-for-bit identical strings. Fuzzy deduplication (MinHash, SimHash, Jaccard on n-grams) extends this to near-verbatim copies. It handles most copy-paste variants well.

Consider these two sentences:

"The server returned a 500 error"
"An internal server error was encountered"

MinHash Jaccard similarity: roughly 0.08 (no shared tokens). Semantic similarity: close to 1.0. Both describe the same event. Fuzzy dedup keeps both; semantic dedup removes one.

For production RAG pipelines, curated training sets, or any corpus where redundancy directly affects downstream quality, this is a meaningful gap.

The Core Approach: Embeddings + Similarity Threshold

The pipeline has three steps:

Embed all documents into a dense vector space using a sentence embedding model
Find pairs (or clusters) of vectors above a similarity threshold
Keep one representative per cluster, discard the rest

Step 2 is the hard part. Naive pairwise comparison is O(n²) — acceptable at 10k documents, unusable at 1M.

Basic semantic dedup for small datasets

from sentence_transformers import SentenceTransformer
import numpy as np

def semantic_dedup(texts: list[str], threshold: float = 0.87) -> list[str]:
    model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = model.encode(texts, batch_size=64, show_progress_bar=True)

    # L2-normalize for cosine similarity via dot product
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    normalized = (embeddings / norms).astype("float32")

    keep = []
    removed = set()

    for i in range(len(texts)):
        if i in removed:
            continue
        keep.append(i)
        sims = normalized[i] @ normalized[i + 1:].T
        duplicates = np.where(sims >= threshold)[0] + i + 1
        removed.update(duplicates.tolist())

    return [texts[i] for i in keep]


texts = [
    "The server returned a 500 error",
    "An internal server error was encountered",
    "The database connection failed",
    "Failed to connect to the database",
    "User authentication succeeded",
]

result = semantic_dedup(texts, threshold=0.82)
print(f"Before: {len(texts)}, After: {len(result)}")
# Before: 5, After: 3

This handles datasets up to ~50k documents. At 100k, the pairwise similarity matrix starts hitting ~40 GB RAM.

Scaling to Millions of Documents with FAISS

For large-scale deduplication, approximate nearest neighbor (ANN) search replaces exhaustive pairwise comparison. FAISS builds an index over your embeddings and returns the K nearest neighbors for each vector without computing all distances.

The strategy: instead of a similarity matrix, you find clusters of near-duplicates and keep the lowest-index member of each cluster. Union-Find handles cluster merging in O(n α(n)) — essentially linear.

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
from collections import defaultdict

def semantic_dedup_large(
    texts: list[str],
    threshold: float = 0.87,
    k_neighbors: int = 10,
    batch_size: int = 512,
) -> list[str]:
    model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = model.encode(
        texts,
        batch_size=batch_size,
        show_progress_bar=True,
        normalize_embeddings=True,
    ).astype("float32")

    dim = embeddings.shape[1]
    index = faiss.IndexFlatIP(dim)
    index.add(embeddings)

    distances, indices = index.search(embeddings, k_neighbors + 1)

    parent = list(range(len(texts)))

    def find(x: int) -> int:
        while parent[x] != x:
            parent[x] = parent[parent[x]]
            x = parent[x]
        return x

    def union(x: int, y: int) -> None:
        px, py = find(x), find(y)
        if px != py:
            parent[px] = py

    for i in range(len(texts)):
        for j, sim in zip(indices[i], distances[i]):
            if j != i and sim >= threshold:
                union(i, j)

    clusters: dict[int, list[int]] = defaultdict(list)
    for i in range(len(texts)):
        clusters[find(i)].append(i)

    kept = sorted(min(members) for members in clusters.values())
    return [texts[i] for i in kept]

With IndexFlatIP (exact search), this scales comfortably to ~5M documents on a machine with 32 GB RAM using 384-dim embeddings. For 10M+, swap in IndexIVFFlat with a coarse quantizer — you trade ~1–2% recall for a 10–20x speedup.

Choosing the Right Threshold

The threshold controls aggressiveness. There is no universal value; it depends on your domain.

Threshold	Effect
0.95+	Near-verbatim only — very conservative
0.87–0.94	Strong semantic overlap removed — good for training data
0.75–0.86	Topically similar content merged — aggressive
< 0.75	Will over-deduplicate most corpora

For RAG knowledge bases, 0.87–0.90 is a reliable starting point. Meta's SemDeDup paper uses a similar range on large-scale web crawls.

Do not pick a threshold blindly. Sample 50 pairs that fall within ±0.03 of your candidate threshold and label them manually as duplicate/not-duplicate. This takes 30 minutes and prevents days of debugging degraded model behavior.

Incremental Deduplication in Practice

In production, you rarely re-embed your entire corpus from scratch. The common pattern is incremental: you have an existing indexed corpus, and you are adding new documents.

Index the existing corpus once. For each new document, query against the index. If the nearest neighbor exceeds your threshold, drop the new document. If not, add it to the index.

This keeps encoding cost proportional to new data, not total corpus size. It also means your FAISS index grows over time — plan for periodic re-indexing if you care about ANN recall staying consistent.

When processing user-generated content or scraped data through these pipelines, review your data handling procedures carefully. A semantic deduplication stage can inadvertently surface sensitive patterns in your corpus. If you are building data pipelines in a security-sensitive context, the security hardening checklists for data pipelines cover the relevant controls.

The Takeaway

Exact and fuzzy deduplication leave real semantic duplicates in your dataset. Semantic dedup via embeddings + ANN search is mature, accessible, and worth adding as a standard pipeline step.

Start with all-MiniLM-L6-v2 and threshold 0.87. Test on a 10k sample, manually inspect ~50 boundary pairs, then scale with FAISS. The quality gains in downstream models — less hallucination, more diverse retrieval results, cleaner training signal — are measurable and worth the extra pipeline step.

I run AYI NEDJIMI Consultants, a cybersecurity consulting firm. We publish free security hardening checklists — PDF and Excel.

AI-Powered Code Documentation Generator from Source Code

Ayi NEDJIMI — Thu, 09 Jul 2026 10:06:13 +0000

Writing documentation is the task developers deprioritize and teams regret. Legacy codebases accumulate thousands of undocumented functions, and onboarding becomes weeks of archaeology. Language models now make it feasible to generate accurate, contextual docstrings directly from source — and wiring this into a working tool takes less than an hour.

The Architecture

The pipeline is: parse → filter → prompt → patch. Each step is testable in isolation. No frameworks, no magic — just Python's ast module, direct API calls, and careful line-number arithmetic.

The core design decision: work with AST nodes and line numbers, not regex. Regex breaks on decorators, multi-line signatures, and nested functions. The AST does not.

Parsing Functions from Source

Python's ast module gives us precise control. We need the function source text, its line number for insertion, and whether it already has a docstring — that last bit ensures idempotency when you run the tool on a partially-documented codebase.

import ast
from dataclasses import dataclass

@dataclass
class FunctionInfo:
    name: str
    source: str
    lineno: int
    col_offset: int
    has_docstring: bool

def extract_functions(source_code: str) -> list[FunctionInfo]:
    tree = ast.parse(source_code)
    results = []

    for node in ast.walk(tree):
        if not isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
            continue

        has_doc = (
            node.body
            and isinstance(node.body[0], ast.Expr)
            and isinstance(node.body[0].value, ast.Constant)
            and isinstance(node.body[0].value.value, str)
        )

        func_source = ast.get_source_segment(source_code, node)
        if func_source:
            results.append(FunctionInfo(
                name=node.name,
                source=func_source,
                lineno=node.lineno,
                col_offset=node.col_offset,
                has_docstring=has_doc,
            ))

    return results

ast.get_source_segment requires Python 3.8+. We track col_offset so the inserted docstring gets the correct indentation — class methods are indented 4 spaces deeper than module-level functions, and getting this wrong produces a SyntaxError on the next parse.

Generating Docstrings with a Language Model

Prompt structure matters more than model choice. A structured prompt that specifies Google-style format — one-line summary, Args, Returns, Raises — produces output you'd actually commit. A vague "document this function" produces noise.

import httpx
import os

def generate_docstring(func: FunctionInfo) -> str:
    prompt = (
        "Generate a concise Google-style docstring for this Python function.\n"
        "Include: one-line summary, Args (if any), Returns, Raises (if applicable).\n"
        "Output only the docstring body — no triple quotes, no function signature.\n\n"
        f"Function:\n{func.source}"
    )

    resp = httpx.post(
        os.environ["LLM_API_BASE"] + "/chat/completions",
        headers={"Authorization": f"Bearer {os.environ['LLM_API_KEY']}"},
        json={
            "model": os.environ.get("LLM_MODEL", "gpt-4o-mini"),
            "messages": [{"role": "user", "content": prompt}],
            "temperature": 0.2,
            "max_tokens": 256,
        },
        timeout=30,
    )
    resp.raise_for_status()
    return resp.json()["choices"][0]["message"]["content"].strip()

Temperature at 0.2 is deliberate — you want factual, repeatable output, not creative reinterpretation. The same function should produce near-identical docstrings on consecutive runs. Any OpenAI-compatible endpoint works here: a local inference server, a managed API, or a fine-tuned model. Swap LLM_API_BASE and you are done.

Patching the Source Without Breaking It

String replacement fails on duplicate function names in different classes. The correct approach: insert docstrings by line number, processing functions in reverse order.

def patch_source(
    source_code: str,
    functions: list[FunctionInfo],
    docstrings: dict[str, str],
) -> str:
    lines = source_code.splitlines(keepends=True)

    # Reverse order so earlier insertions don't shift later line numbers
    to_patch = sorted(
        [(f, docstrings[f.name]) for f in functions if f.name in docstrings],
        key=lambda x: x[0].lineno,
        reverse=True,
    )

    for func, content in to_patch:
        # Walk forward past multi-line signatures to find the closing colon
        insert_at = func.lineno
        while insert_at <= len(lines):
            if lines[insert_at - 1].rstrip().endswith(":"):
                break
            insert_at += 1

        indent = " " * (func.col_offset + 4)
        doc_lines = content.splitlines()
        opening = indent + '"""' + doc_lines[0] + "\n"
        middle  = "".join(indent + line + "\n" for line in doc_lines[1:])
        closing = indent + '"""' + "\n"
        lines.insert(insert_at, opening + middle + closing)

    return "".join(lines)

The reverse-order insight: insert a 5-line docstring at line 20 and every function below shifts by 5. Process top-to-bottom and you corrupt every subsequent line number. Process bottom-to-top and the problem disappears.

The "walk forward to find the colon" loop handles multi-line function signatures — increasingly common with type annotations and keyword arguments that span several lines.

Wiring It Into a CLI

import argparse

def main():
    parser = argparse.ArgumentParser(description="Generate docstrings from source")
    parser.add_argument("file", help="Python file to document")
    parser.add_argument("--dry-run", action="store_true")
    args = parser.parse_args()

    with open(args.file) as f:
        source = f.read()

    funcs = [fn for fn in extract_functions(source) if not fn.has_docstring]
    if not funcs:
        print("No undocumented functions found.")
        return

    print(f"Generating {len(funcs)} docstring(s)...")
    docstrings = {}
    for fn in funcs:
        print(f"  → {fn.name}")
        docstrings[fn.name] = generate_docstring(fn)

    patched = patch_source(source, funcs, docstrings)

    if args.dry_run:
        print(patched)
    else:
        with open(args.file, "w") as f:
            f.write(patched)
        print(f"Done. {len(docstrings)} docstring(s) written.")

if __name__ == "__main__":
    main()

This drops cleanly into a pre-commit hook: fail the build if any undocumented functions exist, or run in auto-fix mode to generate them before the commit lands.

Security Before You Deploy This Internally

Before rolling this out to a team, audit what leaves your network. Source code often carries internal service names, database schemas, API structures, and proprietary business logic. Two concrete mitigations worth implementing:

Self-host the model — the architecture above works with any OpenAI-compatible endpoint. Point LLM_API_BASE at a local inference server and no code leaves your infrastructure. For teams evaluating their LLM deployment security posture, we publish free security hardening checklists that include AI inference infrastructure controls covering data exposure, model access policies, and audit logging.

Strip before sending — replace actual identifiers with generic placeholders before the prompt and restore them after. An AST-based approach makes this mechanical: rename all ast.Name nodes to varN, generate the docstring, then restore the originals in a post-processing step.

Developer tooling that silently exfiltrates source code to third-party APIs is a documented audit finding in regulated environments. The architecture above keeps that control explicit.

The Takeaway

The LLM call is the easy part. The AST manipulation and backwards patching logic deserve thorough unit tests — write cases for nested functions, async generators, class methods, decorated functions with multi-line signatures, and files that mix documented and undocumented functions.

The model layer is intentionally thin: one function, two environment variables. Today it is a cloud API; next quarter it might be a fine-tuned model running on-premises. Keep it swappable and the rest of the tool stays useful regardless of what the ecosystem looks like in six months.

I run AYI NEDJIMI Consultants, a cybersecurity consulting firm. We publish free security hardening checklists — PDF and Excel.

Building a Document Classification System with LLMs

Ayi NEDJIMI — Wed, 08 Jul 2026 10:03:30 +0000

You have thousands of support tickets, contracts, or incident reports landing in a single queue. Someone needs to route them — to the right team, the right priority tier, or the right archive bucket. A traditional ML classifier needs labeled training data you probably don't have. A language model already understands text; the question is how to turn that understanding into a reliable, auditable pipeline.

This article walks through building a practical document classifier backed by a language model API, with structured output, uncertainty handling, and batch processing.

Why Not a Fine-Tuned Classifier?

Fine-tuning a BERT-style model is still the right answer if you have 10,000+ labeled examples and need sub-50ms latency. But for most teams the situation looks different: you have a taxonomy of 10–50 categories, a handful of examples per category, and no one available to label thousands of documents.

A language model handles this with zero-shot or few-shot classification. You describe the categories in plain text, optionally provide 1–3 examples each, and the model classifies. The tradeoff: higher per-request cost and 150–500ms latency. For batch processing and low-frequency classification pipelines, that's acceptable.

Prompt Structure and Structured Output

The single most important decision is forcing structured output. Don't ask the model to explain its classification in prose — ask it to return JSON. Most LLM APIs support constrained output (JSON mode or tool calls) that eliminates malformed responses entirely.

Here's a minimal Python implementation:

import json
import os
from openai import OpenAI

client = OpenAI()
MODEL = os.environ.get("LLM_MODEL", "gpt-4o-mini")

CATEGORIES = {
    "billing": "Invoice issues, payment failures, subscription changes",
    "technical": "Bugs, crashes, integration errors, API failures",
    "security": "Suspected breaches, unusual logins, credential compromise",
    "general": "Anything that does not fit the above categories",
}

def classify_document(text: str) -> dict:
    category_block = "\n".join(
        f"- {name}: {desc}" for name, desc in CATEGORIES.items()
    )

    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a document classifier. "
                    "Return only valid JSON with keys: "
                    "category (string), confidence (float 0.0-1.0), reasoning (string)."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Categories:\n{category_block}\n\n"
                    f"Document:\n{text}\n\n"
                    "Classify this document."
                ),
            },
        ],
        response_format={"type": "json_object"},
        temperature=0,
    )

    result = json.loads(response.choices[0].message.content)

    # Normalize unexpected category values
    if result.get("category") not in CATEGORIES:
        result["category"] = "general"

    return result

Setting temperature=0 is critical — classification is a deterministic task, not a creative one. The response_format constraint means you'll never get a JSON parse error from a malformed response.

Batch Processing with Async

One document at a time is fine for prototypes. Production workflows need throughput. The pattern below uses asyncio with a semaphore to cap concurrent API calls and avoid hitting rate limits:

import asyncio
from openai import AsyncOpenAI

async_client = AsyncOpenAI()

async def classify_batch(
    documents: list[dict],
    max_concurrent: int = 10,
) -> list[dict]:
    semaphore = asyncio.Semaphore(max_concurrent)

    async def _classify_one(doc: dict) -> dict:
        async with semaphore:
            try:
                response = await async_client.chat.completions.create(
                    model=MODEL,
                    messages=[
                        {
                            "role": "system",
                            "content": "Classify to JSON: {category, confidence, reasoning}",
                        },
                        {
                            "role": "user",
                            "content": (
                                "Categories: billing, technical, security, general\n\n"
                                f"Document:\n{doc['text']}"
                            ),
                        },
                    ],
                    response_format={"type": "json_object"},
                    temperature=0,
                )
                clf = json.loads(response.choices[0].message.content)
                return {**doc, "classification": clf, "error": None}
            except Exception as e:
                return {**doc, "classification": None, "error": str(e)}

    return await asyncio.gather(*[_classify_one(d) for d in documents])


if __name__ == "__main__":
    sample_docs = [
        {"id": 1, "text": "My card was charged twice for the same invoice."},
        {"id": 2, "text": "The /webhook endpoint returns 500 on every POST request."},
        {"id": 3, "text": "An unknown IP from Eastern Europe changed our admin credentials at 3am."},
    ]

    results = asyncio.run(classify_batch(sample_docs))
    for r in results:
        clf = r["classification"]
        if clf:
            print(f"[{r['id']}] {clf['category']} ({clf['confidence']:.2f}) — {clf['reasoning'][:60]}")
        else:
            print(f"[{r['id']}] ERROR: {r['error']}")

With 10 concurrent requests, you can typically process 500–700 documents per minute. Errors are caught per-document so a single API failure doesn't abort the entire batch.

Routing on Confidence

The model's confidence field is not statistically calibrated, but it's a reliable signal for routing decisions:

≥ 0.85: auto-classify and route immediately
0.65–0.84: auto-classify, but flag for periodic spot-check
< 0.65: send to human review queue

For security-related documents, lower the auto-route threshold — a misclassified security incident has real consequences. If you're building a triage pipeline, validate the approach against your organization's security incident handling checklist before pushing to production.

You can also extend the schema with a secondary_category field for documents that straddle two categories. A password reset failure is both a billing and a security event; downstream routing logic can then decide which queue takes priority based on business rules.

Evaluating Before You Ship

Label 200–400 documents manually before going live. Compute precision and recall per category, not just overall accuracy — a classifier that routes 90% of documents to "general" looks decent on aggregate metrics but is useless in practice.

Common failure modes to anticipate:

Overlapping category definitions. If two categories share similar descriptions, the model will be inconsistent. Rewrite them to be mutually exclusive and add counter-examples to the prompt for ambiguous cases.

Short documents. One- or two-sentence texts give the model less signal. Concatenate subject lines, sender metadata, or document headers into the classification input before sending it to the model.

Domain drift. Customer phrasing evolves over time. A support ticket about "my plan auto-renewed" looks different from "my subscription charged me unexpectedly," even though they map to the same category. Monitor per-category error rates in production and refresh your category descriptions when drift exceeds a threshold — typically quarterly for stable domains.

Track classification latency and error rates alongside the model metrics. An API timeout that silently falls back to "general" is a reliability bug, not just a model quality problem.

The Takeaway

LLM-based document classification gets you from zero to a working pipeline in a day, without labeled training data. The decisions that matter most are: force structured output so you never parse malformed responses, set temperature to 0 for determinism, write non-overlapping category descriptions, and handle low-confidence outputs explicitly with a routing tier rather than hiding uncertainty.

The per-document cost will exceed a fine-tuned model, but the time saved on data labeling usually justifies it for teams under 10 engineers. Once production traffic generates a labeled dataset, you can use it to train a cheaper, faster model if your volume eventually demands it.

I run AYI NEDJIMI Consultants, a cybersecurity consulting firm. We publish free security hardening checklists — PDF and Excel.