YMori

Posted on Mar 13 • Edited on Jun 23

I Built a WBC Quarterfinal Scouting App with MLB Statcast Data

#baseball #python #streamlit #datascience

What I Built

A Streamlit scouting dashboard for the WBC 2026 Quarterfinal: Japan vs Venezuela.

App: https://wbc-qf-jpn-ven.streamlit.app/
GitHub: https://github.com/yasumorishima/wbc-scouting

For the pool round, I built 30 team-level dashboards (20 teams). But quarterfinals are head-to-head matchups — you want to know "which pitch type is effective against this batter?" and "which zone has the highest opponent BA against this pitcher?" in one place.

5-Tab Structure

🎯 Tab 1: Matchup Preview

Venezuela's predicted starting lineup (9 batters) table, an alert for Machado (NPB player, no Statcast data), and a bench/pinch-hit candidates table.

Each batter expands into a full scouting report:

6 key metrics (AVG/OBP/SLG/OPS/K%/BB%) with MLB average comparison
Radar chart (5-axis, MLB average line overlay)
Zone heatmaps (3x3, 5x5) — BA and xwOBA by zone, split by vs LHP/RHP
Spray charts — split by vs LHP/RHP

Platoon splits (OPS/AVG/K%/BB% side by side)
Pitching plan — overall + vs LHP + vs RHP. Auto-generated from pitch type whiff rates, zone-level BA, count-split OPS, and platoon data
Defensive positioning — auto-generated from spray angle, ground ball rate, and exit velocity, split by pitcher handedness
Pitch type performance table (BA, SLG, Whiff%, Chase%)
Count-based performance (color-coded: green=hitter ahead, red=behind, amber=even)

At the bottom, there's a full analysis section for the starting pitcher (Ranger Suárez, LHP) with hitting approach (as LHB/RHB), arsenal table, movement chart, location heatmaps, platoon splits, and pitch selection by count — all in collapsible expanders.

📋 Tab 2: Game Plan

Statcast data organized by game phase:

Team weakness detection — batters with K% ≥ 22.4% (MLB avg), BB% < 8.3%, or platoon OPS gap ≥ 80 pts, auto-extracted with player names and values
Innings 1-3 vs Suárez (starter) — Batting: SP's K%/BB%/Whiff%/velocity and pitch mix. Pitching: per-batter AVG/K%/BB% grouped by lineup position (#1-3, #4-6, #7-9)
Innings 4-5 (2nd time through or bullpen transition) — Batting: bridge reliever stats. Pitching: MLB league-wide trend (opp OPS rises 15-20% on 2nd time through) plus batter classification by K% and BB%
Innings 6+ (high-leverage) — Batting: closer/setup K%/Whiff%/Chase%/velocity with pitcher type classification. Pitching: platoon matchup data for batters with significant splits, full per-batter stat line
Pinch-hit candidates — bench player AVG/OPS/K%

Every piece of text is driven by MLB Statcast numbers only. No coaching instructions — just data.

⚔️ Tab 3: Lineup Scouting

Team batting radar chart at the top (AVG/OBP/SLG/K%/BB%, 5-axis, MLB average line overlay). Below that, a full roster table and a dropdown selector for individual player analysis (metrics, scouting summary, pitching plan, defensive positioning, radar chart, zone heatmaps, spray charts, etc.).

🎱 Tab 4: Starting Pitcher Analysis

Ranger Suárez's pitching data. Metric cards (avg velocity, avg spin, whiff%, chase%, put away%, opp avg, etc.) and scouting summary, plus collapsible expanders for:

Hitting approach (as LHB / as RHB)
Arsenal table (velocity mph/km/h, break, whiff%, put away%) + movement chart
Pitch location heatmap + platoon splits
Pitch selection by count (donut charts) + count-based performance

🔥 Tab 5: Bullpen Scouting

Bullpen overview (all relievers' ERA, K%, velocity in one info box), then a dropdown selector for individual reliever analysis. Same structure as Tab 4 (metric cards, scouting summary, hitting approach, arsenal, heatmaps, count analysis).

Technical Highlights

Dynamic text generation from raw Statcast data

Six generator functions compute per-player analysis from pitch-by-pitch data:

Function	Purpose
`generate_player_summary()`	Batter scouting summary (strengths/weaknesses)
`generate_pitcher_summary()`	Pitcher scouting summary
`generate_pitching_plan()`	How to pitch to a batter (pitch types, zones, counts, platoon)
`generate_hitting_plan()`	How to hit a pitcher (hittable pitches, zones, counts)
`generate_defensive_positioning()`	Infield/outfield shift recommendation from spray data
`generate_sp_pitch_analysis()`	Starting pitcher's pitch-by-pitch analysis

Each function calculates stats from raw Statcast data and outputs only items that cross statistical thresholds:

# Example: identify the pitch type with highest opponent BA
hittable = sorted(
    [p for p in pt_stats if p["ba"] is not None],
    key=lambda x: x["ba"], reverse=True
)
if hittable and hittable[0]["ba"] >= 0.250:
    h = hittable[0]
    lines.append(
        f"- **Highest opp BA pitch:** {h['label']}"
        f" (BA .{int(h['ba']*1000):03d})"
    )

MLB average as baseline for every stat

A raw number like "SLG .476" is meaningless without context. Every stat shows the MLB average alongside it:

K% 28.3% (MLB avg 22.4%)
BB% 6.1% (MLB avg 8.3%)

Handedness-aware zone names

"Inside" and "outside" flip depending on batter handedness. _zone_names_for_bats() automatically adjusts zone labels so "inside high" is always correct relative to the batter's stance.

Glossary built into every section

Every stat has a ? tooltip (Streamlit's help parameter) showing its definition and MLB average. Count displays include a reading guide ("Balls-Strikes" format) with color legend (🟢 hitter ahead, 🔴 hitter behind, 🟡 even).

Data Source

Baseball Savant Statcast data (2024-2025 MLB regular season)
Retrieved via pybaseball

I Built a WBC 2026 Scouting Dashboard with MLB Statcast Data

DEV Community