The Problem
Performance regressions are invisible in code review.
A careless refactor that recompiles a regex on every function call. A new dependency that adds 40ms to your p95 latency. A database query that wasn't indexed. None of these show up in a diff. They accumulate silently across hundreds of commits — a 3ms latency increase here, a 2% throughput drop there — until they become expensive production incidents.
Type checkers enforce correctness automatically. Linters enforce style automatically. Nothing enforces performance — until now.
Introducing FlameIQ
Today we are releasing FlameIQ v1.0.0 — an open-source, deterministic, CI-native performance regression engine for Python.
pip install flameiq-core
FlameIQ compares your current benchmark results against a stored baseline and fails your CI pipeline if any metric exceeds its configured threshold — the same way a type checker fails your build on a type error.
Quick Start
Step 1 — Initialise
cd my-project
flameiq init
Step 2 — Run your benchmarks and produce a metrics file
{
"schema_version": 1,
"metadata": {
"commit": "abc123",
"branch": "main",
"environment": "ci"
},
"metrics": {
"latency": {
"mean": 120.5,
"p95": 180.0,
"p99": 240.0
},
"throughput": 950.2,
"memory_mb": 512.0
}
}
Step 3 — Set a baseline
flameiq baseline set --metrics benchmark.json
Step 4 — Compare on every PR
flameiq compare --metrics current.json --fail-on-regression
Output:
Metric Baseline Current Change Threshold Status
────────────────────────────────────────────────────────────────────
latency.p95 2.45 ms 4.51 ms +84.08% ±10.0% REGRESSION
throughput 412.30 231.50 -43.84% ±10.0% REGRESSION
✗ REGRESSION — 2 metric(s) exceeded threshold.
Exit code 1. Pipeline fails. Regression caught before merge.
A Real Example: Catching a Regex Regression
Here is the kind of bug FlameIQ is designed to catch. A developer refactors a text processing function and accidentally recompiles the regex on every call:
# FAST — original implementation
def clean(text: str) -> str:
text = re.sub(r"[^\w\s]", "", text) # Python caches compiled regex
text = re.sub(r"\s+", " ", text).strip()
return text.lower()
# SLOW — regressed implementation
def clean(text: str) -> str:
punct_re = re.compile(r"[^\w\s]") # recompiled on every call!
space_re = re.compile(r"\s+") # recompiled on every call!
text = punct_re.sub("", text)
text = space_re.sub(" ", text).strip()
return text.lower()
This is invisible in code review. The logic is identical. The diff looks clean.
FlameIQ catches it with an 84% p95 latency increase — well above the 10% threshold.
GitHub Actions Integration
- name: Install FlameIQ
run: pip install flameiq-core
- name: Restore baseline cache
uses: actions/cache@v4
with:
path: .flameiq/
key: flameiq-${{ github.base_ref }}
- name: Run benchmarks
run: python run_benchmarks.py > metrics.json
- name: Check for regressions
run: flameiq compare --metrics metrics.json --fail-on-regression
Key Design Decisions
Deterministic by design
Given identical inputs, FlameIQ always produces identical outputs. No randomness, no network calls, no datetime.now(). Safe for any CI environment including air-gapped infrastructure.
No vendor dependency
Baselines are local JSON files. No SaaS account. No API keys. No telemetry. Your performance data stays on your infrastructure.
Direction-aware thresholds
FlameIQ knows that latency increases are regressions and throughput decreases are regressions. Thresholds are sign-aware per metric type — no manual configuration required for known metrics.
Statistical mode
For noisy benchmark environments, FlameIQ can apply the Mann-Whitney U test alongside threshold comparison. A regression is only declared if both the threshold is exceeded and the result is statistically significant.
Versioned schema
The metrics schema is versioned (currently v1) with a formal specification. The threshold algorithm and statistical methodology are both fully documented in /specs.
HTML Reports
flameiq report --metrics current.json --output report.html
Generates a self-contained HTML report with a full metric diff table, regression highlights, and trend analysis. No external assets — works offline.
Configuration
flameiq.yaml (created by flameiq init):
thresholds:
latency.p95: 10% # Allow up to 10% latency increase
latency.p99: 15%
throughput: -5% # Allow up to 5% throughput decrease
memory_mb: 8%
baseline:
strategy: rolling_median
rolling_window: 5
statistics:
enabled: false
confidence: 0.95
provider: json
Try the Demo
We built a demo project — flameiq-demo — that walks through the full regression detection workflow using a real Python library:
👉 https://github.com/flameiq/demo-flameiq
Links
- PyPI: https://pypi.org/project/flameiq-core/
- Documentation: https://flameiq-core.readthedocs.io
- Source: https://github.com/flameiq/flameiq-core
- Demo project: https://github.com/flameiq/demo-flameiq
Feedback, issues, and contributions are welcome. If you have caught a regression with FlameIQ or have a use case we haven't considered, open an issue or start a discussion on GitHub.
Tags: python opensource devtools ci performance
Top comments (0)