How to Technically Audit a Vendor's AI System - What's Actually Running Under the Hood

#security #machinelearning #python #ai

I've been asked to audit AI vendor systems several times now — reviewing what the system is actually doing vs what was sold. The patterns I find are consistent enough to write up.

This post is the engineering guide to running that audit. All code below is for authorised testing of systems you have legitimate access to.

Step 1: Inspect the API Call Pattern

The most revealing thing about any AI system is what external API calls it makes. A system making OpenAI calls is using a third-party LLM. A system making no external model calls is either self-hosted or entirely rule-based. The pattern tells you the architecture.

# Run with: mitmproxy -s ai_audit.py
# Note: only use on systems you are authorised to test

import mitmproxy.http
from datetime import datetime
import json

AI_API_DOMAINS = {
    "api.openai.com": "OpenAI",
    "api.anthropic.com": "Anthropic",
    "api.cohere.ai": "Cohere",
    "generativelanguage.googleapis.com": "Google Gemini",
    "api.together.xyz": "Together AI",
    "api.mistral.ai": "Mistral",
}

class AIVendorInspector:
    def __init__(self):
        self.calls = []

    def response(self, flow: mitmproxy.http.HTTPFlow):
        host = flow.request.pretty_host
        ai_provider = next(
            (name for domain, name in AI_API_DOMAINS.items() if domain in host),
            None
        )
        entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "host": host,
            "path": flow.request.path,
            "ai_provider": ai_provider,
            "request_bytes": len(flow.request.content),
        }
        if ai_provider and flow.request.content:
            try:
                body = json.loads(flow.request.content)
                entry["model"] = body.get("model", "unknown")
            except Exception:
                pass
        self.calls.append(entry)

    def verdict(self) -> dict:
        total = len(self.calls)
        ai_calls = [c for c in self.calls if c["ai_provider"]]
        ratio = len(ai_calls) / max(total, 1)
        providers = list(set(c["ai_provider"] for c in ai_calls))
        models = list(set(c.get("model","?") for c in ai_calls if c.get("model")))

        if ratio == 0:
            assessment = "NO EXTERNAL AI DETECTED — system appears rule-based or self-hosted"
        elif ratio < 0.10:
            assessment = "MINIMAL AI — LLM called in <10% of operations, likely for formatting only"
        elif ratio < 0.50:
            assessment = "PARTIAL AI — AI used in some operations; investigate which decisions it makes"
        else:
            assessment = "SIGNIFICANT AI — External model calls central to operation"

        return {
            "total_requests": total,
            "ai_api_calls": len(ai_calls),
            "ai_call_ratio_pct": round(ratio * 100, 1),
            "providers_detected": providers,
            "models_detected": models,
            "assessment": assessment,
        }

addons = [AIVendorInspector()]

Key signal: If external model API calls are zero or minimal, the intelligence is rule-based, not learned. If you see only OpenAI calls for formatting at the end of a chain — that's Pattern 1 (rules engine + GPT wrapper).

Step 2: Semantic Consistency Testing

Genuine NLP models produce consistent outputs across semantically equivalent inputs phrased differently. Keyword-matching systems show high variance on paraphrases because they match surface patterns, not meaning.

import random
import string
from typing import Callable

def paraphrase_variants(text: str, n: int = 20) -> list[str]:
    """
    Generates semantically equivalent variants of an input.
    High output variance across these → keyword-based system.
    Low variance → genuine semantic understanding.
    """
    words = text.split()
    variants = []

    strategies = [
        # Add filler words
        lambda t: f"please {t}",
        lambda t: f"I need to {t}",
        lambda t: f"can you help me {t}",
        # Swap non-critical adjacent words
        lambda t: " ".join(t.split()[1:] + [t.split()[0]]) if len(t.split()) > 2 else t,
        # Add trailing context
        lambda t: f"{t} as soon as possible",
        lambda t: f"{t} for our team",
        # Common abbreviations
        lambda t: t.replace("and", "&").replace("with", "w/"),
        # Introduce minor typo (transposition)
        lambda t: t[:5] + t[6] + t[5] + t[7:] if len(t) > 8 else t,
    ]

    for _ in range(n):
        strategy = random.choice(strategies)
        variants.append(strategy(text))

    return list(set(variants))[:n]

def test_semantic_consistency(
    model_fn: Callable[[str], float],
    test_inputs: list[str],
    variance_threshold: float = 0.18
) -> dict:
    """
    model_fn: takes text input, returns a confidence/score float.
    Returns whether the model behaves like keyword matching or semantic understanding.
    """
    results = []
    for input_text in test_inputs:
        variants = paraphrase_variants(input_text, n=15)
        scores = [model_fn(v) for v in variants]
        score_range = max(scores) - min(scores)
        results.append({
            "original": input_text,
            "score_variance": round(score_range, 3),
            "min": round(min(scores), 3),
            "max": round(max(scores), 3),
            "signal": "KEYWORD_BASED" if score_range > variance_threshold else "SEMANTIC",
        })

    keyword_count = sum(1 for r in results if r["signal"] == "KEYWORD_BASED")
    overall = "KEYWORD_MATCHING" if keyword_count > len(results) * 0.5 else "SEMANTIC_UNDERSTANDING"

    return {
        "inputs_tested": len(results),
        "keyword_signal_count": keyword_count,
        "overall_verdict": overall,
        "details": results,
    }

Step 3: Benchmark Validation — Their Set vs Your Data

import numpy as np
from typing import Callable

def compare_benchmarks(
    model_fn: Callable,  # vendor's inference endpoint
    vendor_test_set: list[dict],    # [{"input": ..., "label": ...}]
    your_production_sample: list[dict],  # same format, from your actual data
) -> dict:
    """
    Compares model performance on vendor's curated test set vs your real data.
    Large delta = demo-tuned. Small delta = robust model.
    """
    def evaluate(dataset):
        correct = []
        for item in dataset:
            pred = model_fn(item["input"])
            correct.append(1 if pred == item["label"] else 0)
        return sum(correct) / len(correct) if correct else 0.0

    vendor_acc = evaluate(vendor_test_set)
    prod_acc = evaluate(your_production_sample)
    delta = vendor_acc - prod_acc

    return {
        "vendor_test_accuracy": round(vendor_acc, 4),
        "your_production_accuracy": round(prod_acc, 4),
        "accuracy_delta": round(delta, 4),
        "demo_tuning_risk": (
            "HIGH — significant gap; model likely optimised for demo scenarios"
            if delta > 0.15 else
            "MEDIUM — some gap; worth investigating the test set composition"
            if delta > 0.08 else
            "LOW — performance consistent across datasets"
        ),
        "recommendation": (
            "REQUEST vendor's test set composition. "
            "Insist on production SLA based on YOUR data, not their benchmark."
            if delta > 0.08 else
            "Benchmark appears robust. Still request right-to-audit clause in contract."
        ),
    }

The Audit Report — What to Produce

After these three tests, you have the evidence for a vendor conversation:

AI VENDOR TECHNICAL AUDIT
==========================
Vendor + Product: [name]
Audit Date: [date]
System Under Test: [description]

FINDINGS:
1. API Inspection
   Total requests observed: [X]
   External AI API calls: [Y] ([Z]%)
   Providers detected: [list]
   Assessment: [from verdict()]

2. Semantic Consistency Test
   Inputs tested: [N]
   Keyword-pattern signals: [count]
   Overall: [KEYWORD_MATCHING / SEMANTIC_UNDERSTANDING]

3. Benchmark Comparison
   Vendor test accuracy: [X%]
   Your production accuracy: [Y%]
   Delta: [Z%]
   Demo-tuning risk: [level]

SUMMARY:
[One paragraph honest assessment of what the system is actually doing]

CONTRACTUAL RECOMMENDATIONS:
- Require production accuracy SLA tied to YOUR data, not vendor benchmarks
- Insist on right-to-audit the model architecture
- Add exit clause if performance benchmarks not met within 6 months
- Request quarterly performance reports on production metrics

This report gives you the evidence to renegotiate, exit, or make a justified case for rebuilding with a vendor who can actually deliver.

Sunil — CEO, Ailoitte. We run AI Stack Audits for companies re-evaluating enterprise AI vendors. 2-day turnaround. ailoitte.com