How I Built BreachAlpha: Quantifying Cybersecurity Breach Impact Using Event Study Methodology

#cybersecurity #opensource #python #showdev

A few months ago I kept running into the same wall while talking to security practitioners: they had solid technical evidence of a breach's severity but no credible financial number to bring to business stakeholders. I decided to fix that.
The result is BreachAlpha, an open source tool that uses event study methodology to measure how breaches move stock prices and predict severity using XGBoost.

The methodology (why it is actually rigorous)

Event study methodology comes from financial economics. The idea is simple: isolate the impact of a specific event on an asset's price by comparing actual returns to expected returns (based on the market's movement). The difference is the "abnormal return."

For breaches, the math is:
AR = R_stock - R_market
CAR = sum of AR over event window

When Equifax disclosed the 2017 breach, the market dropped that week too. Event study separates the market-wide drop from the Equifax-specific drop. The CAR over a (-5, +30) trading day window gives you the net financial impact attributable to the breach.

The market prices in company size, sector dynamics, and breach-specific context. It is more honest than parametric cost models that rely on averages.

Architecture overview

breachalpha/ FastAPI + XGBoost backend
frontend/ React + Vite + Tailwind
tests/ 144 tests, 11 modules
The feature engine computes five core signals:

Abnormal return at Day 0, 1, 5, 30
CAR over (-1,+1) and (-5,+30) windows
Volatility spike (ratio of post-breach to pre-breach realized vol)
Volume change
Recovery time in trading days

These go into an XGBoost classifier that outputs Low/Medium/High/Critical severity plus a 0-100 risk score calculated as a weighted probability sum.
Stock data pipeline
Reliable stock data is harder than it sounds. Yahoo Finance rate limits aggressively. So I built a four-source fallback chain:
pythonsources = [
YFinanceSource(), # primary, Chrome TLS fingerprint
AlphaVantageSource(), # fallback, 25 free calls/day
NSEIndiaSource(), # .NS/.BO tickers
YahooScrapingSource(), # last resort HTML scrape
]
Each source implements fetch() and supports_ticker(). The fetcher gates each source before calling it, so NSE India never tries to resolve a NASDAQ ticker.
Stock data is cached locally with a 24h TTL. In testing this cut API calls by around 80% on repeated runs.

Three engineering decisions worth stealing

Decouple domain exceptions from HTTP
Services raise BreachAlphaError subclasses (TickerNotFoundError, InsufficientDataError, etc.). A single global exception handler in server.py translates them to HTTP status codes. Business logic never imports from FastAPI.
This means services are fully testable without spinning up a web server and switching frameworks later would be a one-file change.
Route factories with injected dependencies
pythondef create_score_routes(limiter: Limiter) -> APIRouter:
router = APIRouter()

... route definitions

return router
The rate limiter gets injected, not imported as a global. Tests pass a mock limiter. This pattern scales well as the number of route modules grows.
ProcessPoolExecutor for CPU-bound feature computation
Feature computation is CPU-heavy. Async/await with threads does not help here because of the GIL. ProcessPoolExecutor actually parallelizes across cores:
pythonwith ProcessPoolExecutor() as executor:
future = executor.submit(compute_features, price_data, breach_date)
features = future.result()
On a 4-core machine this roughly halves computation time for batch scoring.

API surface

The core endpoints:
bashPOST /api/score # score a single company
POST /api/score/auto # auto-search breach data then score
POST /api/explain # step-by-step calculation breakdown
POST /api/upload/analyze # batch score from CSV/XLSX
GET /api/breach-search # search breach incidents
Example curl:
bashcurl -X POST http://localhost:8000/api/score \
-H "Content-Type: application/json" \
-d '{
"company": "Equifax",
"breach_type": "data_leak",
"records_affected": 147000000,
"breach_date": "2017-09-07"
}'
Response includes risk score, severity prediction, confidence, per-class probabilities, and all the raw feature values so you can audit the calculation.
Running it locally
bashgit clone https://github.com/AshayK003/BreachAlpha.git
cd BreachAlpha
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
uvicorn breachalpha.server:app --reload --port 8000

separate terminal

cd frontend && npm install && npm run dev
Frontend at localhost:3000, backend at localhost:8000. The model bootstraps on synthetic data in about 2 seconds the first time.
What I want to improve
The biggest limitation right now is the training data. Synthetic data works for the interface and for demos but a model trained on real, labeled breach events would be significantly more accurate. If you have access to structured historical breach data (VCDB, OSF DataBreaches, similar), I would love to collaborate.
Sector-adjusted baselines are also on the list. A breach hitting a healthcare company has a different risk profile than the same breach at a retail chain, and the model should reflect that.

Contributing

The 144-test suite needs to pass. Coverage is enforced at 60% minimum. Main contribution areas right now:

Expanding the known tickers dictionary (currently 200+ companies)
Additional data sources
Real breach training data
Docker Compose setup for easier deployment

If you work in security research, quant finance, or you are building anything around cyber risk quantification, I would genuinely appreciate feedback on the methodology and the feature set.
Repo Link: https://github.com/AshayK003/BreachAlpha