<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Aditya Kumar Pandey</title>
    <description>The latest articles on DEV Community by Aditya Kumar Pandey (@adityapandeydev).</description>
    <link>https://dev.to/adityapandeydev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3839290%2Fbb02a6ae-6c27-48b4-b356-d22f7d84ef30.jpeg</url>
      <title>DEV Community: Aditya Kumar Pandey</title>
      <link>https://dev.to/adityapandeydev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/adityapandeydev"/>
    <language>en</language>
    <item>
      <title>Benchmarking Polars, DuckDB &amp; Dask for RADIS: My GSoC 2026 Proposal Deep Dive</title>
      <dc:creator>Aditya Kumar Pandey</dc:creator>
      <pubDate>Mon, 23 Mar 2026 05:55:31 +0000</pubDate>
      <link>https://dev.to/adityapandeydev/benchmarking-polars-duckdb-dask-for-radis-my-gsoc-2026-proposal-deep-dive-4e8b</link>
      <guid>https://dev.to/adityapandeydev/benchmarking-polars-duckdb-dask-for-radis-my-gsoc-2026-proposal-deep-dive-4e8b</guid>
      <description>&lt;p&gt;I've spent the last few months diving deep into RADIS — one of the &lt;br&gt;
fastest open-source line-by-line spectroscopic codes available. &lt;br&gt;
RADIS can simulate high-resolution infrared spectra of molecules &lt;br&gt;
like CO₂, H₂O, and CH₄, and it's used by researchers studying &lt;br&gt;
combustion diagnostics, exoplanet atmospheres, and plasma physics.&lt;/p&gt;

&lt;p&gt;But while contributing to the codebase, I discovered a critical &lt;br&gt;
problem hiding underneath the performance — one that could &lt;br&gt;
eventually break RADIS for large databases entirely.&lt;/p&gt;

&lt;p&gt;This post is my deep dive into that problem, the solution I'm &lt;br&gt;
proposing for GSoC 2026, and the technical work I've already done &lt;br&gt;
to prove it works.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Problem: RADIS is Sitting on a Time Bomb
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. Vaex is Unmaintained
&lt;/h3&gt;

&lt;p&gt;RADIS currently uses &lt;strong&gt;Vaex&lt;/strong&gt; for lazy loading of large spectroscopic&lt;br&gt;
databases. Vaex is a brilliant library — it uses memory mapping and &lt;br&gt;
zero-copy lazy computations to handle datasets that don't fit in RAM.&lt;/p&gt;

&lt;p&gt;But here's the uncomfortable truth: &lt;strong&gt;Vaex is no longer actively &lt;br&gt;
maintained.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The last meaningful Vaex release was in 2023. There are no bug fixes,&lt;br&gt;
no security patches, and compatibility with Python 3.13+ is broken. &lt;br&gt;
RADIS currently requires &lt;code&gt;vaex&amp;gt;=4.13&lt;/code&gt; — but if Vaex breaks with a &lt;br&gt;
new Python release (which it already is starting to), RADIS users &lt;br&gt;
would be completely unable to load large databases like HITEMP CO₂.&lt;/p&gt;

&lt;p&gt;This is not a hypothetical risk. It is happening right now.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. The Databases Are Getting Huge
&lt;/h3&gt;

&lt;p&gt;Spectroscopic databases have grown dramatically:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Database&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;HITRAN CO&lt;/td&gt;
&lt;td&gt;~160K lines&lt;/td&gt;
&lt;td&gt;Fits in RAM easily&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HITEMP CO&lt;/td&gt;
&lt;td&gt;~1.1M lines&lt;/td&gt;
&lt;td&gt;Benefits from lazy loading&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HITEMP CO₂&lt;/td&gt;
&lt;td&gt;100M+ lines, ~50GB&lt;/td&gt;
&lt;td&gt;Cannot fit in memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ExoMol&lt;/td&gt;
&lt;td&gt;10B+ lines&lt;/td&gt;
&lt;td&gt;Requires streaming&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For HITEMP CO₂ — the database researchers need most for combustion &lt;br&gt;
and climate modeling — Vaex's memory mapping loads the ENTIRE 50GB &lt;br&gt;
file before any filtering happens. This creates a 3+ hour parsing &lt;br&gt;
time just to start a calculation.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Dual Code Paths = Bugs and Inconsistencies
&lt;/h3&gt;

&lt;p&gt;RADIS currently maintains &lt;strong&gt;parallel code paths&lt;/strong&gt; for Pandas and Vaex&lt;br&gt;
DataFrames. The &lt;code&gt;config["DATAFRAME_ENGINE"]&lt;/code&gt; setting switches between &lt;br&gt;
them, but many functions have &lt;code&gt;if/else&lt;/code&gt; branches for both formats.&lt;/p&gt;

&lt;p&gt;For example, &lt;code&gt;set_broadening_coef()&lt;/code&gt; in the ExoMol pipeline defines &lt;br&gt;
broadening coefficients as NumPy arrays — but with Vaex, these could &lt;br&gt;
and should be lazy arrays. This dual maintenance creates subtle bugs &lt;br&gt;
and makes the codebase harder to extend (see &lt;br&gt;
&lt;a href="https://github.com/radis/radis/issues/746" rel="noopener noreferrer"&gt;Issue #746&lt;/a&gt;).&lt;/p&gt;


&lt;h2&gt;
  
  
  The Solution: Polars + A Clean Abstraction Layer
&lt;/h2&gt;

&lt;p&gt;My GSoC 2026 proposal introduces two things that solve all three &lt;br&gt;
problems above simultaneously.&lt;/p&gt;
&lt;h3&gt;
  
  
  A. The DataFrameAdapter Pattern
&lt;/h3&gt;

&lt;p&gt;Instead of having RADIS code call Pandas or Vaex APIs directly, I'm &lt;br&gt;
introducing a &lt;strong&gt;DataFrameAdapter abstraction layer&lt;/strong&gt; — an abstract &lt;br&gt;
base class that all RADIS calculation code uses exclusively.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DataFrameAdapter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ABC&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nd"&gt;@abstractmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;

    &lt;span class="nd"&gt;@abstractmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;filter_range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wmin&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wmax&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;

    &lt;span class="nd"&gt;@abstractmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;select_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cols&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;

    &lt;span class="nd"&gt;@abstractmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;

    &lt;span class="nd"&gt;@abstractmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;to_pandas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;

    &lt;span class="nd"&gt;@abstractmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;to_numpy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PolarAdapter&lt;/strong&gt; → Primary backend using Polars LazyFrame&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PandasAdapter&lt;/strong&gt; → Legacy fallback for backward compatibility
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DaskAdapter&lt;/strong&gt; → Optional for distributed computing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DuckDBAdapter&lt;/strong&gt; → Secondary candidate with SQL interface&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DaskAdapter&lt;/strong&gt; → Optional for distributed cluster computing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key insight: &lt;strong&gt;if a better library appears in 5 years, only a &lt;br&gt;
new adapter class is needed — zero changes to RADIS calculation code.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  B. Polars with Predicate Pushdown
&lt;/h3&gt;

&lt;p&gt;Here's where the magic happens. Let me show you exactly what changes&lt;br&gt;
with Polars.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Current behavior (Vaex/Pandas):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# User calls:
&lt;/span&gt;&lt;span class="nf"&gt;calc_spectrum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1900&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;molecule&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CO&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# What RADIS does internally:
# Step 1: Load ENTIRE HITEMP-CO database (~1.1M lines) into memory
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vaex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;~/.radisdb/HITEMP-CO.hdf5&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# reads ALL 50GB
&lt;/span&gt;
&lt;span class="c1"&gt;# Step 2: Then filter
&lt;/span&gt;&lt;span class="n"&gt;filtered&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;wav&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;1900&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;wav&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;2300&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="c1"&gt;# ~200K lines remain — but 50GB was already loaded!
&lt;/span&gt;
&lt;span class="c1"&gt;# Memory used: proportional to FULL database size 
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Proposed behavior (Polars with predicate pushdown):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# User calls the EXACT SAME API:
&lt;/span&gt;&lt;span class="nf"&gt;calc_spectrum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1900&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;molecule&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CO&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# What RADIS does internally with PolarAdapter:
# Step 1: Create a LAZY query — nothing is read yet
&lt;/span&gt;&lt;span class="n"&gt;lazy_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;~/.radisdb/HITEMP-CO.parquet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;wav&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;is_between&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1900&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2300&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;wav&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;int&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;A&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gamma_air&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Polars pushes the filter DOWN to the Parquet reader
# Only rows matching wav range are read from disk
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lazy_query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Memory used: proportional to FILTERED data only 
# For 400 cm⁻¹ window on HITEMP CO₂: ~2-5GB instead of 50GB
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For HITEMP-CO₂ (50GB), this means reading &lt;strong&gt;~2-5GB instead of 50GB&lt;/strong&gt; &lt;br&gt;
for a typical 400 cm⁻¹ window query. That's a &lt;strong&gt;10-25x reduction in &lt;br&gt;
I/O&lt;/strong&gt; — directly addressing the 3+ hour parsing time.&lt;/p&gt;


&lt;h2&gt;
  
  
  Early Benchmark Results
&lt;/h2&gt;

&lt;p&gt;I've already run preliminary benchmarks comparing Vaex, Polars, &lt;br&gt;
DuckDB, and PyArrow on HITRAN and HITEMP databases. Here's what &lt;br&gt;
the data shows:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cold Load Time (seconds, lower is better):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;HITRAN CO (160K)&lt;/th&gt;
&lt;th&gt;HITEMP CO (1.1M)&lt;/th&gt;
&lt;th&gt;HITEMP CO₂ (100M+)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Vaex&lt;/td&gt;
&lt;td&gt;5.1s&lt;/td&gt;
&lt;td&gt;4.8s&lt;/td&gt;
&lt;td&gt;45.3s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Polars&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.9s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.1s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;6.3s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DuckDB&lt;/td&gt;
&lt;td&gt;3.4s&lt;/td&gt;
&lt;td&gt;1.5s&lt;/td&gt;
&lt;td&gt;12.1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PyArrow&lt;/td&gt;
&lt;td&gt;5.3s&lt;/td&gt;
&lt;td&gt;1.4s&lt;/td&gt;
&lt;td&gt;10.5s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Peak Memory Usage (MB, lower is better):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;HITRAN CO (160K)&lt;/th&gt;
&lt;th&gt;HITEMP CO (1.1M)&lt;/th&gt;
&lt;th&gt;HITEMP CO₂ (100M+)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Vaex&lt;/td&gt;
&lt;td&gt;82MB&lt;/td&gt;
&lt;td&gt;430MB&lt;/td&gt;
&lt;td&gt;10.4GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Polars&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;18MB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;63MB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.8GB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DuckDB&lt;/td&gt;
&lt;td&gt;23MB&lt;/td&gt;
&lt;td&gt;80MB&lt;/td&gt;
&lt;td&gt;1.9GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PyArrow&lt;/td&gt;
&lt;td&gt;22MB&lt;/td&gt;
&lt;td&gt;93MB&lt;/td&gt;
&lt;td&gt;1.2GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;*&lt;em&gt;**Polars emerges as the leading candidate&lt;/em&gt;* — showing &lt;br&gt;
the fastest cold load time and lowest memory usage in &lt;br&gt;
preliminary tests, thanks to its Rust-based engine and &lt;br&gt;
native predicate pushdown support with Parquet. However, &lt;br&gt;
the final backend decision will be made after comprehensive &lt;br&gt;
benchmarking during the community bonding period. DuckDB &lt;br&gt;
remains a strong secondary candidate for its SQL query &lt;br&gt;
interface, and Dask for distributed computing scenarios.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: These are based on my PR #981 implementation and the existing &lt;br&gt;
&lt;code&gt;vaex_vs_pandas_performance.py&lt;/code&gt; benchmark in the RADIS repo. Final &lt;br&gt;
results will be validated during the community bonding period.&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  What I've Already Built
&lt;/h2&gt;

&lt;p&gt;This isn't just a proposal — I've already started the implementation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PR #981: Add Polars/Parquet lazy-loading backend with &lt;br&gt;
DataFrameAdapter&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This PR implements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The core &lt;code&gt;DataFrameAdapter&lt;/code&gt; abstract base class&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PolarAdapter&lt;/code&gt; with lazy scanning, filter, select, compute methods&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PandasAdapter&lt;/code&gt; as legacy fallback&lt;/li&gt;
&lt;li&gt;Factory pattern via &lt;code&gt;config["DATAFRAME_ENGINE"]&lt;/code&gt; in radis.json&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Previous RADIS contributions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PR #894&lt;/strong&gt;: Refactored database I/O for better HDF5 file handling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PR #971&lt;/strong&gt;: Added support for multiple broadening species (H2, He,
CO2) beyond default air broadening&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PR #924&lt;/strong&gt;: Addressed spectroscopic computation optimization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PR #958&lt;/strong&gt;: Enhanced DataFileManager operations for cached databases&lt;/li&gt;
&lt;li&gt;PR #981**: (ref #978, #658) Add Polars/Parquet lazy-loading backend with DataFrameAdapter — Implemented the core DataFrameAdapter abstraction layer and PolarsAdapter backend for replacing Vaex with modern lazy-loading&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Plus contributions to &lt;strong&gt;Astroquery&lt;/strong&gt; (PR #3536, PR #19345) and &lt;strong&gt;JuliaAstro&lt;/strong&gt; &lt;br&gt;
(SpectralFitting.jl PR #241, PR #242, PR #203) demonstrating cross-ecosystem engagement.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Migration: Zero Breaking Changes
&lt;/h2&gt;

&lt;p&gt;One concern I anticipated: &lt;em&gt;what about existing users who depend on &lt;br&gt;
Vaex?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The answer is: &lt;strong&gt;nothing breaks&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Existing users — zero changes needed:
&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DATAFRAME_ENGINE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vaex&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# still works via PandasAdapter
&lt;/span&gt;
&lt;span class="c1"&gt;# New default after GSoC:
&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DATAFRAME_ENGINE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;polars&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# 10-25x faster
&lt;/span&gt;
&lt;span class="c1"&gt;# Migration is automatic:
# On first use after upgrade, existing HDF5 files are 
# auto-converted to Parquet. Originals kept as backup.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The DataFrameAdapter pattern means the switch is transparent to all &lt;br&gt;
RADIS calculation code. &lt;code&gt;calc_spectrum()&lt;/code&gt;, &lt;code&gt;eq_spectrum()&lt;/code&gt;, and all &lt;br&gt;
other user-facing APIs remain &lt;strong&gt;100% unchanged&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters for Science
&lt;/h2&gt;

&lt;p&gt;This isn't just a software engineering improvement. Faster, more &lt;br&gt;
memory-efficient database loading directly enables science that is &lt;br&gt;
currently impossible or impractical:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Combustion Diagnostics&lt;/strong&gt;: Real-time spectral fitting for industrial &lt;br&gt;
combustion characterization requires HITEMP databases with 100M+ CO₂ &lt;br&gt;
lines. Faster filtering means faster spectral fitting — critical for &lt;br&gt;
in-situ diagnostics in turbine engines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exoplanet Atmosphere Characterization&lt;/strong&gt;: Researchers using the James &lt;br&gt;
Webb Space Telescope need to characterize atmospheres using ExoMol &lt;br&gt;
databases with 10B+ lines. Current memory limitations force them to &lt;br&gt;
use truncated databases that lose spectral detail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Atmospheric Science&lt;/strong&gt;: Climate models that need complete HITRAN/GEISA&lt;br&gt;
datasets for all greenhouse gases simultaneously can't run on standard &lt;br&gt;
hardware today. Lazy loading makes this feasible.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next: My GSoC 2026 Plan
&lt;/h2&gt;

&lt;p&gt;If selected for GSoC 2026, here's what I'll deliver over 12 weeks:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 1 — Community Bonding (Apr 30 - May 26):&lt;/strong&gt;&lt;br&gt;
Run comprehensive benchmarks, finalize DataFrameAdapter API design &lt;br&gt;
with mentor consensus, set up CI pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 2 — Coding Phase 1 (May 26 - Jul 12):&lt;/strong&gt;&lt;br&gt;
Complete DataFrameAdapter with PolarAdapter + PandasAdapter, refactor &lt;br&gt;
all database loading functions, implement lazy loading for HITEMP CO₂,&lt;br&gt;
H₂O, and other large databases, write 45+ unit tests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 3 — Coding Phase 2 (Jul 12 - Aug 25):&lt;/strong&gt;&lt;br&gt;
Implement configurable cache size limits + LRU eviction, fix &lt;br&gt;
broadening coefficient lazy evaluation (Issue #746), ensure ExoJAX &lt;br&gt;
interoperability, complete comprehensive documentation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final Deliverables:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Production-ready DataFrameAdapter replacing all Vaex dependencies&lt;/li&gt;
&lt;li&gt;10-25x I/O reduction for large database queries&lt;/li&gt;
&lt;li&gt;45+ unit tests, 12+ integration tests, 90%+ code coverage&lt;/li&gt;
&lt;li&gt;User guide + performance comparison documentation&lt;/li&gt;
&lt;li&gt;Blog post series documenting the entire journey&lt;/li&gt;
&lt;/ul&gt;







&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Live Demo&lt;/strong&gt;: &lt;a href="https://colab.research.google.com/drive/1G69lCy8iHCX_Q2fxn-C6zOCFUhPQgAHf#scrollTo=hUMAT7zZk0V8" rel="noopener noreferrer"&gt;https://colab.research.google.com/drive/1G69lCy8iHCX_Q2fxn-C6zOCFUhPQgAHf#scrollTo=hUMAT7zZk0V8&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;RADIS is an incredible scientific tool — but it's sitting on an &lt;br&gt;
unmaintained dependency that could break it for the largest, most &lt;br&gt;
scientifically valuable databases. The solution isn't just replacing &lt;br&gt;
Vaex with Polars — it's building a clean abstraction layer that makes &lt;br&gt;
RADIS future-proof regardless of which DataFrame library wins &lt;br&gt;
the next 10 years.&lt;/p&gt;

&lt;p&gt;I'm excited about this project because it sits at the intersection of &lt;br&gt;
software engineering and real scientific impact. Every millisecond we &lt;br&gt;
shave off a HITEMP CO₂ query is a millisecond closer to understanding &lt;br&gt;
exoplanet atmospheres, improving combustion efficiency, and advancing &lt;br&gt;
climate science.&lt;/p&gt;

&lt;p&gt;If you're interested in following this project, you can find my work &lt;br&gt;
at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/aditya-pandey-dev" rel="noopener noreferrer"&gt;aditya-pandey-dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;RADIS PR #981: My core GSoC contribution&lt;/li&gt;
&lt;li&gt;RADIS Repository: &lt;a href="https://github.com/radis/radis" rel="noopener noreferrer"&gt;radis/radis&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;This post is part of my GSoC 2026 application to OpenAstronomy/RADIS.&lt;br&gt;
The project: "Integrate a Modern Lazy-Loading Alternative for &lt;br&gt;
Large-Scale Spectroscopic Database Processing."&lt;/em&gt;&lt;/p&gt;




</description>
      <category>gsoc</category>
      <category>python</category>
      <category>opensource</category>
      <category>radis</category>
    </item>
  </channel>
</rss>
