<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shreyas</title>
    <description>The latest articles on DEV Community by Shreyas (@dev-shreyas).</description>
    <link>https://dev.to/dev-shreyas</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3980948%2F70a0cc57-95d7-4dd6-b667-220c54fa9f60.png</url>
      <title>DEV Community: Shreyas</title>
      <link>https://dev.to/dev-shreyas</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dev-shreyas"/>
    <language>en</language>
    <item>
      <title>I Built a Stable Sorting Algorithm That Beats Java's Dual-Pivot Quicksort</title>
      <dc:creator>Shreyas</dc:creator>
      <pubDate>Fri, 12 Jun 2026 09:38:33 +0000</pubDate>
      <link>https://dev.to/dev-shreyas/i-built-a-stable-sorting-algorithm-that-beats-javas-dual-pivot-quicksort-fck</link>
      <guid>https://dev.to/dev-shreyas/i-built-a-stable-sorting-algorithm-that-beats-javas-dual-pivot-quicksort-fck</guid>
      <description>&lt;p&gt;A few days ago I finished benchmarking something I've been building - a cache-aware, stable, histogram-based sorting algorithm I'm calling &lt;strong&gt;BusSort&lt;/strong&gt;. The results surprised even me.&lt;/p&gt;

&lt;p&gt;At 100 million elements, it runs &lt;strong&gt;~2x faster than Java's Dual-Pivot Quicksort&lt;/strong&gt; on random data - while being &lt;strong&gt;stable&lt;/strong&gt;. Dual-Pivot QS is not.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem With Quicksort at Scale
&lt;/h2&gt;

&lt;p&gt;Quicksort-based algorithms partition elements with random writes across the entire array. At large scales this causes &lt;strong&gt;cache thrashing&lt;/strong&gt; - elements are being written to memory locations all over the place, constantly missing L1 and L2 cache.&lt;/p&gt;

&lt;p&gt;The larger the array, the worse it gets.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Idea
&lt;/h2&gt;

&lt;p&gt;Instead of scattering elements globally, BusSort processes data in &lt;strong&gt;L1 cache-sized chunks&lt;/strong&gt; - 4096 integers (~16KB).&lt;/p&gt;

&lt;p&gt;For each chunk, it does 4 passes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PASS 1&lt;/strong&gt; - Scan left-to-right, compute bucket for each element, build a local histogram&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PASS 2&lt;/strong&gt; - Compute local prefix sums (bucket positions within the chunk)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PASS 3&lt;/strong&gt; - Scatter into a local grouped buffer - because this buffer is L1-sized, all random writes stay &lt;strong&gt;in cache&lt;/strong&gt; ✅&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PASS 4&lt;/strong&gt; - Copy each bucket's portion to its correct global position&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With &lt;strong&gt;128-way splitting&lt;/strong&gt;, recursion depth stays at just ~4 levels even for 100M elements.&lt;/p&gt;

&lt;p&gt;Base case: &lt;strong&gt;Insertion Sort&lt;/strong&gt; for ≤ 1024 elements.&lt;/p&gt;

&lt;p&gt;On the benchmark machine (i5-1135G7, 48KB L1 data cache):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;4096 × 3 × 4 bytes = 49,152 bytes ≈ 48KB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The three working arrays fit &lt;strong&gt;exactly&lt;/strong&gt; in L1. Not a coincidence.&lt;/p&gt;




&lt;h2&gt;
  
  
  Benchmark Results
&lt;/h2&gt;

&lt;p&gt;Tested against &lt;code&gt;Arrays.sort(int[])&lt;/code&gt; - Java's &lt;strong&gt;Dual-Pivot Quicksort&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;n = 100,000,000 | Java 17 | i5-1135G7 @ 2.40GHz&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Input Type&lt;/th&gt;
&lt;th&gt;BusSort&lt;/th&gt;
&lt;th&gt;Dual-Pivot QS&lt;/th&gt;
&lt;th&gt;Ratio&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Random&lt;/td&gt;
&lt;td&gt;3991ms&lt;/td&gt;
&lt;td&gt;8604ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~2x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sorted&lt;/td&gt;
&lt;td&gt;57ms&lt;/td&gt;
&lt;td&gt;104ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~2x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reverse&lt;/td&gt;
&lt;td&gt;280ms&lt;/td&gt;
&lt;td&gt;166ms&lt;/td&gt;
&lt;td&gt;0.6x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nearly Sorted&lt;/td&gt;
&lt;td&gt;2452ms&lt;/td&gt;
&lt;td&gt;2789ms&lt;/td&gt;
&lt;td&gt;~1.1x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Duplicates&lt;/td&gt;
&lt;td&gt;712ms&lt;/td&gt;
&lt;td&gt;2242ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~2.4x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Few Duplicates&lt;/td&gt;
&lt;td&gt;1295ms&lt;/td&gt;
&lt;td&gt;3185ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~2.3x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;All Same&lt;/td&gt;
&lt;td&gt;51ms&lt;/td&gt;
&lt;td&gt;32ms&lt;/td&gt;
&lt;td&gt;0.6x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Clustered&lt;/td&gt;
&lt;td&gt;1419ms&lt;/td&gt;
&lt;td&gt;2242ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~1.6x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Consistently faster on most input types. Stable. Zero comparison overhead.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The two losses (Reverse, All Same) are where Dual-Pivot QS has structural advantages - run detection and early termination.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why It's Stable
&lt;/h2&gt;

&lt;p&gt;BusSort preserves relative order of equal elements because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chunks are processed &lt;strong&gt;left-to-right&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Within each chunk, elements are scattered &lt;strong&gt;left-to-right&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;globalNext[b]&lt;/code&gt; advances per chunk - earlier chunks always land before later chunks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes BusSort one of very few algorithms that is both &lt;strong&gt;faster than Dual-Pivot Quicksort&lt;/strong&gt; and &lt;strong&gt;stable&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Trade-offs
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Requires &lt;strong&gt;O(n) auxiliary space&lt;/strong&gt; - standard for stable sorts&lt;/li&gt;
&lt;li&gt;Slower than Dual-Pivot Quicksort on already-sorted and reverse-sorted input (improvement planned)&lt;/li&gt;
&lt;li&gt;Currently supports &lt;code&gt;int[]&lt;/code&gt; only - generic support coming soon&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Generic object support with &lt;code&gt;Comparator&amp;lt;T&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Benchmark against TimSort on &lt;code&gt;Integer[]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Parallel variant&lt;/li&gt;
&lt;li&gt;JMH benchmarks for more rigorous measurement&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  GitHub
&lt;/h2&gt;

&lt;p&gt;This is a work in progress. Theory, benchmarks, and current implementation are published.&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://github.com/dev-shreyaspatil/BusSort" rel="noopener noreferrer"&gt;github.com/dev-shreyaspatil/BusSort&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;Curious what the Java community thinks - especially around the cache-chunking approach and whether there are similar published algorithms I should be comparing against.&lt;/p&gt;

</description>
      <category>algorithms</category>
      <category>java</category>
      <category>performance</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
