DEV Community

YMori
YMori

Posted on • Edited on • Originally published at yasumorishima.github.io

MLB Pitcher Arsenal Evolution Dataset (2020-2025)

Introduction

I've published a Kaggle dataset that tracks the year-over-year evolution of MLB pitchers' arsenals from 2020 to 2025. This dataset enables analysis of how pitchers adjust their pitch mix, velocity, and spin rates over time.

Dataset Link: https://www.kaggle.com/datasets/yasunorim/mlb-pitcher-arsenal-2020-2025

Dataset Overview

This dataset provides pitch-by-pitch aggregated metrics for MLB pitchers across six seasons (2020-2025).

Basic Information

  • Period: 2020-2025 seasons (6 seasons)
  • Rows: 4,253 rows (pitcher × season combinations)
  • Columns: 111 columns
  • Format: Wide format (1 row = 1 pitcher × 1 season)
  • Filter: Only pitchers with 100+ pitches in a season
  • Quality Score: 10.0/10 on Kaggle

Data Source

Data is sourced from MLB Advanced Media (Statcast) via the pybaseball library and aggregated by pitcher × season × pitch type.

Data Structure

Identifier Columns (3 columns)

  • player_id: MLB player ID
  • player_name: Player name
  • season: Season year (2020-2025)

Pitch Metrics (18 pitch types × 6 metrics = 108 columns)

For each pitch type, the dataset includes 6 metrics:

  1. usage_pct: Usage rate (0-100%)
  2. avg_speed: Average velocity (mph)
  3. avg_spin: Average spin rate (rpm)
  4. whiff_rate: Whiff rate (0-1)
  5. avg_pfx_x: Average horizontal movement (inches, gravity-adjusted)
  6. avg_pfx_z: Average vertical movement (inches, gravity-adjusted)

Pitch Types (18 types)

FF (Four-seam), SI (Sinker), FC (Cutter), SL (Slider), CU (Curve), CH (Changeup), FS (Splitter), KC (Knuckle Curve), FO (Forkball), EP (Eephus), KN (Knuckleball), ST (Sweeper), SV (Slurve), and more.

Usage

Downloading the Data

Download the CSV directly from Kaggle:

import pandas as pd

# Download via Kaggle API
!kaggle datasets download -d yasunorim/mlb-pitcher-arsenal-2020-2025

# Load data
df = pd.read_csv('pitcher_arsenal_evolution_2020_2025.csv')
Enter fullscreen mode Exit fullscreen mode

Analysis Notebook

A comprehensive analysis notebook is also available:

https://www.kaggle.com/code/yasunorim/pitcher-arsenal-analysis

The notebook includes:

  • Individual pitcher arsenal trend analysis
  • MLB-wide pitch type trends (2020-2025)
  • Velocity change analysis
  • Correlation heatmaps

Use Cases

1. Pitcher Arsenal Trend Analysis

Track year-over-year changes in a specific pitcher's repertoire:

# Yusei Kikuchi's pitch usage evolution
kikuchi = df[df['player_name'].str.contains('Kikuchi', case=False)]
kikuchi.plot(x='season', y=['SL_usage_pct', 'FF_usage_pct', 'CH_usage_pct'])
Enter fullscreen mode Exit fullscreen mode

For Kikuchi, slider usage increased from ~20% in 2019 to over 40% in 2022-2025, showing a significant shift in pitch mix strategy.

2. MLB-Wide Trend Analysis

Visualize league-wide pitch type trends:

# Calculate yearly average usage rates
yearly_avg = df.groupby('season')[['FF_usage_pct', 'SI_usage_pct', 'SL_usage_pct']].mean()
yearly_avg.plot(kind='line', marker='o')
Enter fullscreen mode Exit fullscreen mode

Recent trends show increasing usage of sliders and sweepers across MLB.

3. Machine Learning Features

Use arsenal metrics as features for pitcher performance prediction:

# Calculate pitch diversity
df['pitch_diversity'] = (df[[col for col in df.columns if 'usage_pct' in col]] > 5).sum(axis=1)
Enter fullscreen mode Exit fullscreen mode

Related Datasets

I've published other MLB datasets on Kaggle:

All datasets have a quality score of 10.0/10.

Links

Top comments (0)