Introducing FlameIQ — Deterministic Performance Regression Detection for Python

#opensource #performance #python #testing

The Problem

Performance regressions are invisible in code review.

A careless refactor that recompiles a regex on every function call. A new dependency that adds 40ms to your p95 latency. A database query that wasn't indexed. None of these show up in a diff. They accumulate silently across hundreds of commits — a 3ms latency increase here, a 2% throughput drop there — until they become expensive production incidents.

Type checkers enforce correctness automatically. Linters enforce style automatically. Nothing enforces performance — until now.

Introducing FlameIQ

Today we are releasing FlameIQ v1.0.0 — an open-source, deterministic, CI-native performance regression engine for Python.

pip install flameiq-core

FlameIQ compares your current benchmark results against a stored baseline and fails your CI pipeline if any metric exceeds its configured threshold — the same way a type checker fails your build on a type error.

Quick Start

Step 1 — Initialise

cd my-project
flameiq init

Step 2 — Run your benchmarks and produce a metrics file

{
  "schema_version": 1,
  "metadata": {
    "commit": "abc123",
    "branch": "main",
    "environment": "ci"
  },
  "metrics": {
    "latency": {
      "mean": 120.5,
      "p95": 180.0,
      "p99": 240.0
    },
    "throughput": 950.2,
    "memory_mb": 512.0
  }
}

Step 3 — Set a baseline

flameiq baseline set --metrics benchmark.json

Step 4 — Compare on every PR

flameiq compare --metrics current.json --fail-on-regression

Output:

  Metric           Baseline    Current      Change   Threshold  Status
  ────────────────────────────────────────────────────────────────────
  latency.p95       2.45 ms     4.51 ms     +84.08%    ±10.0%  REGRESSION
  throughput        412.30      231.50      -43.84%    ±10.0%  REGRESSION

  ✗ REGRESSION — 2 metric(s) exceeded threshold.

Exit code 1. Pipeline fails. Regression caught before merge.

A Real Example: Catching a Regex Regression

Here is the kind of bug FlameIQ is designed to catch. A developer refactors a text processing function and accidentally recompiles the regex on every call:

# FAST — original implementation
def clean(text: str) -> str:
    text = re.sub(r"[^\w\s]", "", text)   # Python caches compiled regex
    text = re.sub(r"\s+", " ", text).strip()
    return text.lower()

# SLOW — regressed implementation
def clean(text: str) -> str:
    punct_re = re.compile(r"[^\w\s]")     # recompiled on every call!
    space_re = re.compile(r"\s+")          # recompiled on every call!
    text = punct_re.sub("", text)
    text = space_re.sub(" ", text).strip()
    return text.lower()

This is invisible in code review. The logic is identical. The diff looks clean.
FlameIQ catches it with an 84% p95 latency increase — well above the 10% threshold.

GitHub Actions Integration

- name: Install FlameIQ
  run: pip install flameiq-core

- name: Restore baseline cache
  uses: actions/cache@v4
  with:
    path: .flameiq/
    key: flameiq-${{ github.base_ref }}

- name: Run benchmarks
  run: python run_benchmarks.py > metrics.json

- name: Check for regressions
  run: flameiq compare --metrics metrics.json --fail-on-regression

Key Design Decisions

Deterministic by design
Given identical inputs, FlameIQ always produces identical outputs. No randomness, no network calls, no datetime.now(). Safe for any CI environment including air-gapped infrastructure.

No vendor dependency
Baselines are local JSON files. No SaaS account. No API keys. No telemetry. Your performance data stays on your infrastructure.

Direction-aware thresholds
FlameIQ knows that latency increases are regressions and throughput decreases are regressions. Thresholds are sign-aware per metric type — no manual configuration required for known metrics.

Statistical mode
For noisy benchmark environments, FlameIQ can apply the Mann-Whitney U test alongside threshold comparison. A regression is only declared if both the threshold is exceeded and the result is statistically significant.

Versioned schema
The metrics schema is versioned (currently v1) with a formal specification. The threshold algorithm and statistical methodology are both fully documented in /specs.

HTML Reports

flameiq report --metrics current.json --output report.html

Generates a self-contained HTML report with a full metric diff table, regression highlights, and trend analysis. No external assets — works offline.

Configuration

flameiq.yaml (created by flameiq init):

thresholds:
  latency.p95:   10%    # Allow up to 10% latency increase
  latency.p99:   15%
  throughput:    -5%    # Allow up to 5% throughput decrease
  memory_mb:      8%

baseline:
  strategy: rolling_median
  rolling_window: 5

statistics:
  enabled: false
  confidence: 0.95

provider: json