YMori

Posted on Mar 4 • Edited on Jul 5

Beyond Marcel: Adding Bayesian Regression to NPB Baseball Predictions — A 15-Step Journey

#baseball #python #bayesian #datascience

Introduction

In a previous project, I built an NPB (Nippon Professional Baseball) player projection system using the Marcel method — a simple "3-year weighted average + regression to the mean" approach.

GitHub: npb-prediction

Marcel performed surprisingly well: it beat ML models (LightGBM/XGBoost) for pitcher ERA prediction. But it had clear limitations:

Limitation	Marcel's approach
New foreign players	Use league average (can't use previous league stats)
Point estimates only	No uncertainty quantification
Uniform age adjustment	+0.3%/year for all players
Ignores skill metrics	Can't leverage K% or BB%

To address these, I started a new project: Bayesian regression on top of Marcel.

GitHub: npb-bayes-projection

Here's what happened over 15 steps.

Key Terms (for first-time readers)

Term	Meaning
Marcel method	A simple projection method using a 3-year weighted average of past stats (weights: 5:4:3, recent years weighted higher)
Bayesian regression (Stan)	A statistical method that estimates probability distributions from data, capturing uncertainty in predictions
wOBA	Weighted On-Base Average — a batting metric that weights walks, singles, doubles, and home runs differently
ERA	Earned Run Average — earned runs allowed per 9 innings. A basic pitcher performance metric
K% (strikeout rate)	Percentage of plate appearances ending in a strikeout. More stable year-to-year than ERA
BB% (walk rate)	Percentage of plate appearances ending in a walk
BABIP	Batting Average on Balls In Play. Seasons far from ~0.300 suggest luck is a factor
MAE	Mean Absolute Error — average prediction miss. Lower is better

Step 1: Foreign Player Conversion Factors

NPB teams sign foreign players every year. If we can convert their previous-league stats to NPB scale, we should beat league-average predictions.

What I did

Identified 365 foreign players in NPB (2015-2025)
Matched 231 to FanGraphs using name normalization
Computed conversion ratios:

League	wOBA ratio (hitters)	ERA ratio (pitchers)	n
MLB→NPB	1.235	0.579	56 / 74
AAA→NPB	1.271	0.462	9 / 6

Hitter wOBA improves ~24%, pitcher ERA improves ~42% — reflecting the level difference between MLB and NPB.

Lesson

Raw conversion factors performed worse than the baseline (league average). wOBA/ERA are "outcome metrics" heavily dependent on environment. I pivoted to using them as priors in a Bayesian model.

Steps 2-4: From PyMC to Stan — Discovering Skill Metrics

PyMC Hierarchical Model (failed)

Learned a shrinkage weight w ≈ 0.12, essentially ignoring previous-league stats.

Stan v1 — K%/BB% Features (success)

The key insight: K% and BB% are environment-independent skill metrics, unlike wOBA/ERA.

Hitter: npb_wOBA = lg_avg + β_woba·z_woba + β_K·z_K + β_BB·z_BB + noise
Pitcher: npb_ERA = lg_avg + β_era·z_era + β_fip·z_fip + β_K·z_K + β_BB·z_BB + noise

Results (2020-2025 backtest):

Model	MAE	Baseline	Improvement
Hitter v0 (wOBA only)	0.0330	0.0337	-2.1%
Hitter v1 (+K%/BB%)	0.0325	0.0337	-3.8%
Pitcher v0 (ERA only)	0.749	0.749	±0%
Pitcher v1 (+K%/BB%/FIP)	0.736	0.749	-1.7%

Steps 5-6: Data Enrichment & Team Projections

Data Growth

Master database: 365 → 393 players
FanGraphs matches: 231 → 253
Hitter improvement: -3.8% → -5.1%

Monte Carlo Team Simulation

Add per-player noise (σ = Marcel backtest MAE)
Aggregate team RS/RA
Pythagorean expectation (exp=1.83) → win totals
Repeat 10,000 times

Backtest (2018-2025, 96 team-seasons): MAE = 6.41 wins, 80% CI coverage = 86.5%

Steps 7-9: Japanese Player Stan Model — The Real Challenge

Foreign players are ~90/year. The real target was 1,300+ Japanese players.

Model Design

Hitter: actual_wOBA = Marcel_wOBA + δ_K·z_K + δ_BB·z_BB + δ_BABIP·z_babip + noise
Pitcher: actual_ERA = Marcel_ERA + δ_K·z_K + δ_BB·z_BB + noise

Key finding: K%/BB% are already embedded in wOBA (BB is a direct component). Instead, BABIP (luck component) provided the signal — high BABIP regresses the following year.

Scaling Fix

Independent RS/RA scaling was canceling Stan's systematic improvements. Developed marcel_anchored scaling: ΔMAE improved from -0.063 to -0.154 (2.4x better).

Steps 10-11: Pursuing Statistical Significance

Used Ridge regression to approximate the Stan Bayesian model for fast LOO-CV.

Player-Level (2018-2025)

Metric	n	Marcel MAE	Stan MAE	p-value	Bootstrap
Hitter wOBA	2,208	0.05023	0.04980	0.060	97.1%
Pitcher ERA	2,164	1.23008	1.22241	0.057	97.1%

The 5-Feature Discovery

Adding K/9 and BB/9 to the pitcher model:

Model	p-value	Bootstrap
ERA (3 features: K%, BB%, age)	0.607	68.9%
ERA (5 features: +K/9, BB/9)	0.012	99.3%

K% (per plate appearance) and K/9 (per inning) carry different information — using both captures pitcher skill more accurately.

Steps 12-14: Improving Team Predictions

FA attribution fix: Assign traded players to correct teams
League-average imputation: Fill uncovered PA/IP with league averages
Coverage improvement: Added 14 missing birthdays → PA_cov +2pp
4 new features: pa_stability, ip_stability, prev_babip_p, prev_woba_dev_sq

Step 15: Hitting the Ceiling

The Paradox

Player-level: Stan > Marcel (p=0.06, Bootstrap 97%)
Team-level: Stan < Marcel (+0.198W worse)

Root Cause: PA-Weighted Aggregation

Quartile	PA Range	Stan Win Rate
Q1 (low PA)	30-64	55% (Stan strongest)
Q2	65-157	46%
Q3	158-361	44%
Q4 (high PA)	362-685	49%

Stan excels for low-PA players where Marcel is unreliable, but team RS is PA-weighted — Q4 (regulars) dominates. For regulars, Marcel's 3-year weighted average is already quite accurate, leaving little room for K%/BB%/BABIP corrections.

min_pa_team Sweep

min_pa_team	Marcel MAE	Stan MAE	Δ
0 (current)	6.725	6.923	+0.198
50	6.682	6.903	+0.221
100	6.644	6.868	+0.224

Filtering low-PA players improves Marcel but worsens the Stan-Marcel gap. The structural problem can't be solved by filtering.

Conclusions

Final Results

Level	Marcel MAE	Stan MAE	Verdict
Player (wOBA)	0.05023	0.04980	Stan better (p=0.06)
Player (ERA)	1.23008	1.22241	Stan better (p=0.06)
Team (wins)	6.725	6.923	Marcel better

Four Takeaways

Marcel is a remarkably strong baseline. The 3-year weighted average is highly accurate for regular players.
K%/BB% are valuable as environment-independent skill metrics, especially for low-PA players and cross-league comparisons.
Player-level improvements don't automatically translate to team-level gains. PA-weighted aggregation is dominated by high-PA players where Marcel is already strong.
Without batted-ball quality data (barrel rate, exit velocity, whiff rate), the next wall can't be broken.

What's Next

The plan is to apply these learnings to MLB, where Statcast data provides the batted-ball and pitch-quality metrics that NPB's public data lacks.

Data sources: Baseball Data Freak, NPB Official

DEV Community