<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shridhar Pandey</title>
    <description>The latest articles on DEV Community by Shridhar Pandey (@shridey).</description>
    <link>https://dev.to/shridey</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1358144%2F5aed0bb1-4eb0-4a4d-ae26-cea38d0cf066.png</url>
      <title>DEV Community: Shridhar Pandey</title>
      <link>https://dev.to/shridey</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shridey"/>
    <language>en</language>
    <item>
      <title>The ‘Missing Middle’ of Data Processing in Java (10M Rows in ~40s)</title>
      <dc:creator>Shridhar Pandey</dc:creator>
      <pubDate>Thu, 16 Apr 2026 12:00:00 +0000</pubDate>
      <link>https://dev.to/shridey/10m-records-40-seconds-o1-memory-why-i-built-a-lightweight-etl-engine-for-java-l1n</link>
      <guid>https://dev.to/shridey/10m-records-40-seconds-o1-memory-why-i-built-a-lightweight-etl-engine-for-java-l1n</guid>
      <description>&lt;h2&gt;
  
  
  10M Records, 40s: Exploring the "Missing Middle" of Data Processing in Java
&lt;/h2&gt;

&lt;p&gt;I’ve always found it strange how quickly developers leave the Java ecosystem when dealing with data processing.&lt;/p&gt;

&lt;p&gt;If your data fits comfortably in memory, Java Streams work great. If you're processing massive datasets (50GB+), tools like Spark make sense. But what about everything in between?&lt;/p&gt;

&lt;p&gt;What about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a 300MB CSV&lt;/li&gt;
&lt;li&gt;a nested JSON file&lt;/li&gt;
&lt;li&gt;a one-off transformation you need to run locally&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not big enough for distributed systems. Too big for naive in-memory approaches.&lt;/p&gt;

&lt;p&gt;This is the space I think is underserved, the &lt;em&gt;"missing middle"&lt;/em&gt; of data processing.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem I Kept Running Into
&lt;/h2&gt;

&lt;p&gt;Every time I tried handling mid-sized datasets in Java, I hit the same wall:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Load everything into memory → &lt;code&gt;OutOfMemoryError&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Use Streams → elegant, but still memory-bound&lt;/li&gt;
&lt;li&gt;Switch to Python/Pandas → works, but now I’ve left the JVM ecosystem entirely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That tradeoff didn’t sit right with me.&lt;/p&gt;

&lt;p&gt;So I started exploring a different approach:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What if we treated data processing as a streaming pipeline instead of an in-memory transformation?&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Goal
&lt;/h2&gt;

&lt;p&gt;I set a simple constraint:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Process ~10 million records (~300MB CSV) in under 40 seconds, on a single JVM, without blowing up memory.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Approach: Streaming + Lazy Evaluation
&lt;/h2&gt;

&lt;p&gt;Instead of loading data into a &lt;code&gt;List&amp;lt;Row&amp;gt;&lt;/code&gt;, I built a pipeline that processes data &lt;strong&gt;row-by-row&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;At a high level:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Operations are represented as a &lt;strong&gt;DAG (Directed Acyclic Graph)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Execution is &lt;strong&gt;lazy&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Data flows through a pipeline, not into memory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This keeps memory usage close to constant O(1) for streaming transformations.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Important caveat: operations like &lt;code&gt;groupBy&lt;/code&gt; and &lt;code&gt;merge&lt;/code&gt; still require state and are not O(1) memory.)&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What Didn’t Work (and Why)
&lt;/h2&gt;

&lt;p&gt;This part took longer than expected.&lt;/p&gt;

&lt;p&gt;Some early approaches completely failed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Naive in-memory loading&lt;/strong&gt; → instant OOM on large files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Eager evaluation&lt;/strong&gt; → unnecessary intermediate objects, heavy GC pressure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simple grouping logic&lt;/strong&gt; → memory spikes that killed performance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The biggest realization:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The problem isn’t just data size, it’s &lt;em&gt;when&lt;/em&gt; and &lt;em&gt;how&lt;/em&gt; you materialize it.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  A Small Experiment: PureStream
&lt;/h2&gt;

&lt;p&gt;This exploration led me to build a small library: &lt;strong&gt;PureStream&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Not as a Spark replacement. Not as a “framework”.&lt;/p&gt;

&lt;p&gt;Just a lightweight way to experiment with streaming-first data pipelines in Java.&lt;/p&gt;

&lt;h3&gt;
  
  
  What it focuses on:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Zero external dependencies (Java 17+)&lt;/li&gt;
&lt;li&gt;Streaming-first transformations&lt;/li&gt;
&lt;li&gt;Familiar, fluent API (inspired by Streams)&lt;/li&gt;
&lt;li&gt;Basic CSV and JSON handling (with JSON flattening via dot notation)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;PureStream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromCsv&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"transactions.csv"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getDouble&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"amount"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;1000.0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;groupBy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"region"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;agg&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;sum&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"amount"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;count&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;orderBy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sort&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;sort&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;descDouble&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"sum_amount"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;toJsonFile&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"report.json"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Benchmark Context
&lt;/h2&gt;

&lt;p&gt;Tested on machine with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Java 17&lt;/li&gt;
&lt;li&gt;16GB RAM&lt;/li&gt;
&lt;li&gt;256GB SSD storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Dataset:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~10 million rows (~300MB CSV)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~40 seconds end-to-end processing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn’t meant to be a rigorous benchmark, just a sanity check that the approach is viable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where It Breaks
&lt;/h2&gt;

&lt;p&gt;This approach isn’t perfect, and I’m still exploring its limits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;groupBy&lt;/code&gt; and &lt;code&gt;merge&lt;/code&gt; require memory (no magic here)&lt;/li&gt;
&lt;li&gt;Current joins use a hash-based approach → not scalable for very large datasets&lt;/li&gt;
&lt;li&gt;Performance depends heavily on disk I/O&lt;/li&gt;
&lt;li&gt;API is still evolving&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One area I’m particularly interested in next:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Implementing an external sort-merge join to handle large joins with limited memory&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Bigger Question
&lt;/h2&gt;

&lt;p&gt;I don’t think tools like Spark are overkill.&lt;/p&gt;

&lt;p&gt;I think we’re missing a simpler layer for everyday data tasks, something between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Java Streams&lt;/li&gt;
&lt;li&gt;and full distributed systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Maybe this idea already exists and I’ve missed it.&lt;br&gt;
Maybe it’s not as useful as I think.&lt;/p&gt;




&lt;h2&gt;
  
  
  If You’re Curious
&lt;/h2&gt;

&lt;p&gt;Code is here if you want to explore or break it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/shridey/purestream" rel="noopener noreferrer"&gt;https://github.com/shridey/purestream&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Maven: &lt;a href="https://central.sonatype.com/artifact/io.github.shridey/purestream" rel="noopener noreferrer"&gt;https://central.sonatype.com/artifact/io.github.shridey/purestream&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Open Question
&lt;/h2&gt;

&lt;p&gt;How do you currently handle mid-sized datasets in Java?&lt;/p&gt;

&lt;p&gt;Do you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;stick with Streams and hope it fits in memory&lt;/li&gt;
&lt;li&gt;switch ecosystems (Python, Spark, etc.)&lt;/li&gt;
&lt;li&gt;or use something else entirely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I’m curious if this “missing middle” is a real problem, or just something I’ve personally run into.&lt;/p&gt;

</description>
      <category>java</category>
      <category>performance</category>
      <category>datascience</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
