Aditya Kumar Pandey

Posted on Mar 23

Benchmarking Polars, DuckDB & Dask for RADIS: My GSoC 2026 Proposal Deep Dive

#gsoc #python #opensource #radis

I've spent the last few months diving deep into RADIS — one of the
fastest open-source line-by-line spectroscopic codes available.
RADIS can simulate high-resolution infrared spectra of molecules
like CO₂, H₂O, and CH₄, and it's used by researchers studying
combustion diagnostics, exoplanet atmospheres, and plasma physics.

But while contributing to the codebase, I discovered a critical
problem hiding underneath the performance — one that could
eventually break RADIS for large databases entirely.

This post is my deep dive into that problem, the solution I'm
proposing for GSoC 2026, and the technical work I've already done
to prove it works.

The Problem: RADIS is Sitting on a Time Bomb

1. Vaex is Unmaintained

RADIS currently uses Vaex for lazy loading of large spectroscopic
databases. Vaex is a brilliant library — it uses memory mapping and
zero-copy lazy computations to handle datasets that don't fit in RAM.

But here's the uncomfortable truth: Vaex is no longer actively
maintained.

The last meaningful Vaex release was in 2023. There are no bug fixes,
no security patches, and compatibility with Python 3.13+ is broken.
RADIS currently requires vaex>=4.13 — but if Vaex breaks with a
new Python release (which it already is starting to), RADIS users
would be completely unable to load large databases like HITEMP CO₂.

This is not a hypothetical risk. It is happening right now.

2. The Databases Are Getting Huge

Spectroscopic databases have grown dramatically:

Database	Size	Status
HITRAN CO	~160K lines	Fits in RAM easily
HITEMP CO	~1.1M lines	Benefits from lazy loading
HITEMP CO₂	100M+ lines, ~50GB	Cannot fit in memory
ExoMol	10B+ lines	Requires streaming

For HITEMP CO₂ — the database researchers need most for combustion
and climate modeling — Vaex's memory mapping loads the ENTIRE 50GB
file before any filtering happens. This creates a 3+ hour parsing
time just to start a calculation.

3. Dual Code Paths = Bugs and Inconsistencies

RADIS currently maintains parallel code paths for Pandas and Vaex
DataFrames. The config["DATAFRAME_ENGINE"] setting switches between
them, but many functions have if/else branches for both formats.

For example, set_broadening_coef() in the ExoMol pipeline defines
broadening coefficients as NumPy arrays — but with Vaex, these could
and should be lazy arrays. This dual maintenance creates subtle bugs
and makes the codebase harder to extend (see
Issue #746).

The Solution: Polars + A Clean Abstraction Layer

My GSoC 2026 proposal introduces two things that solve all three
problems above simultaneously.

A. The DataFrameAdapter Pattern

Instead of having RADIS code call Pandas or Vaex APIs directly, I'm
introducing a DataFrameAdapter abstraction layer — an abstract
base class that all RADIS calculation code uses exclusively.

class DataFrameAdapter(ABC):
    @abstractmethod
    def load(self, path, columns=None): ...

    @abstractmethod
    def filter_range(self, col, wmin, wmax): ...

    @abstractmethod
    def select_columns(self, cols): ...

    @abstractmethod
    def compute(self): ...

    @abstractmethod
    def to_pandas(self): ...

    @abstractmethod
    def to_numpy(self): ...

This means:

PolarAdapter → Primary backend using Polars LazyFrame
PandasAdapter → Legacy fallback for backward compatibility
DaskAdapter → Optional for distributed computing
DuckDBAdapter → Secondary candidate with SQL interface
DaskAdapter → Optional for distributed cluster computing

The key insight: if a better library appears in 5 years, only a
new adapter class is needed — zero changes to RADIS calculation code.

B. Polars with Predicate Pushdown

Here's where the magic happens. Let me show you exactly what changes
with Polars.

Current behavior (Vaex/Pandas):

# User calls:
calc_spectrum(1900, 2300, molecule='CO')

# What RADIS does internally:
# Step 1: Load ENTIRE HITEMP-CO database (~1.1M lines) into memory
df = vaex.open('~/.radisdb/HITEMP-CO.hdf5')  # reads ALL 50GB

# Step 2: Then filter
filtered = df[(df['wav'] >= 1900) & (df['wav'] <= 2300)]
# ~200K lines remain — but 50GB was already loaded!

# Memory used: proportional to FULL database size

Proposed behavior (Polars with predicate pushdown):

# User calls the EXACT SAME API:
calc_spectrum(1900, 2300, molecule='CO')

# What RADIS does internally with PolarAdapter:
# Step 1: Create a LAZY query — nothing is read yet
lazy_query = (
    pl.scan_parquet('~/.radisdb/HITEMP-CO.parquet')
    .filter(pl.col('wav').is_between(1900, 2300))
    .select(['wav', 'int', 'A', 'gamma_air'])
)

# Step 2: Polars pushes the filter DOWN to the Parquet reader
# Only rows matching wav range are read from disk
result = lazy_query.collect()

# Memory used: proportional to FILTERED data only 
# For 400 cm⁻¹ window on HITEMP CO₂: ~2-5GB instead of 50GB

For HITEMP-CO₂ (50GB), this means reading ~2-5GB instead of 50GB
for a typical 400 cm⁻¹ window query. That's a 10-25x reduction in
I/O — directly addressing the 3+ hour parsing time.

Early Benchmark Results

I've already run preliminary benchmarks comparing Vaex, Polars,
DuckDB, and PyArrow on HITRAN and HITEMP databases. Here's what
the data shows:

Cold Load Time (seconds, lower is better):

Backend	HITRAN CO (160K)	HITEMP CO (1.1M)	HITEMP CO₂ (100M+)
Vaex	5.1s	4.8s	45.3s
Polars	0.9s	1.1s	6.3s
DuckDB	3.4s	1.5s	12.1s
PyArrow	5.3s	1.4s	10.5s

Peak Memory Usage (MB, lower is better):

Backend	HITRAN CO (160K)	HITEMP CO (1.1M)	HITEMP CO₂ (100M+)
Vaex	82MB	430MB	10.4GB
Polars	18MB	63MB	0.8GB
DuckDB	23MB	80MB	1.9GB
PyArrow	22MB	93MB	1.2GB

***Polars emerges as the leading candidate* — showing
the fastest cold load time and lowest memory usage in
preliminary tests, thanks to its Rust-based engine and
native predicate pushdown support with Parquet. However,
the final backend decision will be made after comprehensive
benchmarking during the community bonding period. DuckDB
remains a strong secondary candidate for its SQL query
interface, and Dask for distributed computing scenarios.

Note: These are based on my PR #981 implementation and the existing
vaex_vs_pandas_performance.py benchmark in the RADIS repo. Final
results will be validated during the community bonding period.

What I've Already Built

This isn't just a proposal — I've already started the implementation.

PR #981: Add Polars/Parquet lazy-loading backend with
DataFrameAdapter

This PR implements:

The core DataFrameAdapter abstract base class
PolarAdapter with lazy scanning, filter, select, compute methods
PandasAdapter as legacy fallback
Factory pattern via config["DATAFRAME_ENGINE"] in radis.json

Previous RADIS contributions:

PR #894: Refactored database I/O for better HDF5 file handling
PR #971: Added support for multiple broadening species (H2, He, CO2) beyond default air broadening
PR #924: Addressed spectroscopic computation optimization
PR #958: Enhanced DataFileManager operations for cached databases
PR #981**: (ref #978, #658) Add Polars/Parquet lazy-loading backend with DataFrameAdapter — Implemented the core DataFrameAdapter abstraction layer and PolarsAdapter backend for replacing Vaex with modern lazy-loading

Plus contributions to Astroquery (PR #3536, PR #19345) and JuliaAstro
(SpectralFitting.jl PR #241, PR #242, PR #203) demonstrating cross-ecosystem engagement.

The Migration: Zero Breaking Changes

One concern I anticipated: what about existing users who depend on
Vaex?

The answer is: nothing breaks.

# Existing users — zero changes needed:
config["DATAFRAME_ENGINE"] = "vaex"  # still works via PandasAdapter

# New default after GSoC:
config["DATAFRAME_ENGINE"] = "polars"  # 10-25x faster

# Migration is automatic:
# On first use after upgrade, existing HDF5 files are 
# auto-converted to Parquet. Originals kept as backup.

The DataFrameAdapter pattern means the switch is transparent to all
RADIS calculation code. calc_spectrum(), eq_spectrum(), and all
other user-facing APIs remain 100% unchanged.

Why This Matters for Science

This isn't just a software engineering improvement. Faster, more
memory-efficient database loading directly enables science that is
currently impossible or impractical:

Combustion Diagnostics: Real-time spectral fitting for industrial
combustion characterization requires HITEMP databases with 100M+ CO₂
lines. Faster filtering means faster spectral fitting — critical for
in-situ diagnostics in turbine engines.

Exoplanet Atmosphere Characterization: Researchers using the James
Webb Space Telescope need to characterize atmospheres using ExoMol
databases with 10B+ lines. Current memory limitations force them to
use truncated databases that lose spectral detail.

Atmospheric Science: Climate models that need complete HITRAN/GEISA
datasets for all greenhouse gases simultaneously can't run on standard
hardware today. Lazy loading makes this feasible.

What's Next: My GSoC 2026 Plan

If selected for GSoC 2026, here's what I'll deliver over 12 weeks:

Phase 1 — Community Bonding (Apr 30 - May 26):
Run comprehensive benchmarks, finalize DataFrameAdapter API design
with mentor consensus, set up CI pipeline.

Phase 2 — Coding Phase 1 (May 26 - Jul 12):
Complete DataFrameAdapter with PolarAdapter + PandasAdapter, refactor
all database loading functions, implement lazy loading for HITEMP CO₂,
H₂O, and other large databases, write 45+ unit tests.

Phase 3 — Coding Phase 2 (Jul 12 - Aug 25):
Implement configurable cache size limits + LRU eviction, fix
broadening coefficient lazy evaluation (Issue #746), ensure ExoJAX
interoperability, complete comprehensive documentation.

Final Deliverables:

Production-ready DataFrameAdapter replacing all Vaex dependencies
10-25x I/O reduction for large database queries
45+ unit tests, 12+ integration tests, 90%+ code coverage
User guide + performance comparison documentation
Blog post series documenting the entire journey

Live Demo: https://colab.research.google.com/drive/1G69lCy8iHCX_Q2fxn-C6zOCFUhPQgAHf#scrollTo=hUMAT7zZk0V8.

Conclusion

RADIS is an incredible scientific tool — but it's sitting on an
unmaintained dependency that could break it for the largest, most
scientifically valuable databases. The solution isn't just replacing
Vaex with Polars — it's building a clean abstraction layer that makes
RADIS future-proof regardless of which DataFrame library wins
the next 10 years.

I'm excited about this project because it sits at the intersection of
software engineering and real scientific impact. Every millisecond we
shave off a HITEMP CO₂ query is a millisecond closer to understanding
exoplanet atmospheres, improving combustion efficiency, and advancing
climate science.

If you're interested in following this project, you can find my work
at:

GitHub: aditya-pandey-dev
RADIS PR #981: My core GSoC contribution
RADIS Repository: radis/radis

This post is part of my GSoC 2026 application to OpenAstronomy/RADIS.
The project: "Integrate a Modern Lazy-Loading Alternative for
Large-Scale Spectroscopic Database Processing."

DEV Community