DEV Community

angufibo lincoln
angufibo lincoln

Posted on

Introducing FlameIQ — Deterministic Performance Regression Detection for Python

The Problem

Performance regressions are invisible in code review.

A careless refactor that recompiles a regex on every function call. A new dependency that adds 40ms to your p95 latency. A database query that wasn't indexed. None of these show up in a diff. They accumulate silently across hundreds of commits — a 3ms latency increase here, a 2% throughput drop there — until they become expensive production incidents.

Type checkers enforce correctness automatically. Linters enforce style automatically. Nothing enforces performance — until now.


Introducing FlameIQ

Today we are releasing FlameIQ v1.0.0 — an open-source, deterministic, CI-native performance regression engine for Python.

pip install flameiq-core
Enter fullscreen mode Exit fullscreen mode

FlameIQ compares your current benchmark results against a stored baseline and fails your CI pipeline if any metric exceeds its configured threshold — the same way a type checker fails your build on a type error.


Quick Start

Step 1 — Initialise

cd my-project
flameiq init
Enter fullscreen mode Exit fullscreen mode

Step 2 — Run your benchmarks and produce a metrics file

{
  "schema_version": 1,
  "metadata": {
    "commit": "abc123",
    "branch": "main",
    "environment": "ci"
  },
  "metrics": {
    "latency": {
      "mean": 120.5,
      "p95": 180.0,
      "p99": 240.0
    },
    "throughput": 950.2,
    "memory_mb": 512.0
  }
}
Enter fullscreen mode Exit fullscreen mode

Step 3 — Set a baseline

flameiq baseline set --metrics benchmark.json
Enter fullscreen mode Exit fullscreen mode

Step 4 — Compare on every PR

flameiq compare --metrics current.json --fail-on-regression
Enter fullscreen mode Exit fullscreen mode

Output:

  Metric           Baseline    Current      Change   Threshold  Status
  ────────────────────────────────────────────────────────────────────
  latency.p95       2.45 ms     4.51 ms     +84.08%    ±10.0%  REGRESSION
  throughput        412.30      231.50      -43.84%    ±10.0%  REGRESSION

  ✗ REGRESSION — 2 metric(s) exceeded threshold.
Enter fullscreen mode Exit fullscreen mode

Exit code 1. Pipeline fails. Regression caught before merge.


A Real Example: Catching a Regex Regression

Here is the kind of bug FlameIQ is designed to catch. A developer refactors a text processing function and accidentally recompiles the regex on every call:

# FAST — original implementation
def clean(text: str) -> str:
    text = re.sub(r"[^\w\s]", "", text)   # Python caches compiled regex
    text = re.sub(r"\s+", " ", text).strip()
    return text.lower()

# SLOW — regressed implementation
def clean(text: str) -> str:
    punct_re = re.compile(r"[^\w\s]")     # recompiled on every call!
    space_re = re.compile(r"\s+")          # recompiled on every call!
    text = punct_re.sub("", text)
    text = space_re.sub(" ", text).strip()
    return text.lower()
Enter fullscreen mode Exit fullscreen mode

This is invisible in code review. The logic is identical. The diff looks clean.
FlameIQ catches it with an 84% p95 latency increase — well above the 10% threshold.


GitHub Actions Integration

- name: Install FlameIQ
  run: pip install flameiq-core

- name: Restore baseline cache
  uses: actions/cache@v4
  with:
    path: .flameiq/
    key: flameiq-${{ github.base_ref }}

- name: Run benchmarks
  run: python run_benchmarks.py > metrics.json

- name: Check for regressions
  run: flameiq compare --metrics metrics.json --fail-on-regression
Enter fullscreen mode Exit fullscreen mode

Key Design Decisions

Deterministic by design
Given identical inputs, FlameIQ always produces identical outputs. No randomness, no network calls, no datetime.now(). Safe for any CI environment including air-gapped infrastructure.

No vendor dependency
Baselines are local JSON files. No SaaS account. No API keys. No telemetry. Your performance data stays on your infrastructure.

Direction-aware thresholds
FlameIQ knows that latency increases are regressions and throughput decreases are regressions. Thresholds are sign-aware per metric type — no manual configuration required for known metrics.

Statistical mode
For noisy benchmark environments, FlameIQ can apply the Mann-Whitney U test alongside threshold comparison. A regression is only declared if both the threshold is exceeded and the result is statistically significant.

Versioned schema
The metrics schema is versioned (currently v1) with a formal specification. The threshold algorithm and statistical methodology are both fully documented in /specs.


HTML Reports

flameiq report --metrics current.json --output report.html
Enter fullscreen mode Exit fullscreen mode

Generates a self-contained HTML report with a full metric diff table, regression highlights, and trend analysis. No external assets — works offline.


Configuration

flameiq.yaml (created by flameiq init):

thresholds:
  latency.p95:   10%    # Allow up to 10% latency increase
  latency.p99:   15%
  throughput:    -5%    # Allow up to 5% throughput decrease
  memory_mb:      8%

baseline:
  strategy: rolling_median
  rolling_window: 5

statistics:
  enabled: false
  confidence: 0.95

provider: json
Enter fullscreen mode Exit fullscreen mode

Try the Demo

We built a demo project — flameiq-demo — that walks through the full regression detection workflow using a real Python library:

👉 https://github.com/flameiq/demo-flameiq


Links


Feedback, issues, and contributions are welcome. If you have caught a regression with FlameIQ or have a use case we haven't considered, open an issue or start a discussion on GitHub.


Tags: python opensource devtools ci performance

Top comments (0)