<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shreyas</title>
    <description>The latest articles on DEV Community by Shreyas (@dev-shreyas).</description>
    <link>https://dev.to/dev-shreyas</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3980948%2F70a0cc57-95d7-4dd6-b667-220c54fa9f60.png</url>
      <title>DEV Community: Shreyas</title>
      <link>https://dev.to/dev-shreyas</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dev-shreyas"/>
    <language>en</language>
    <item>
      <title>I Built a Stable Sorting Algorithm That Beats Java's Dual-Pivot Quicksort and TimSort</title>
      <dc:creator>Shreyas</dc:creator>
      <pubDate>Fri, 12 Jun 2026 09:38:33 +0000</pubDate>
      <link>https://dev.to/dev-shreyas/i-built-a-stable-sorting-algorithm-that-beats-javas-dual-pivot-quicksort-fck</link>
      <guid>https://dev.to/dev-shreyas/i-built-a-stable-sorting-algorithm-that-beats-javas-dual-pivot-quicksort-fck</guid>
      <description>&lt;p&gt;A few days ago I finished benchmarking something I've been building - a cache-aware, stable, histogram-based sorting algorithm I'm calling &lt;strong&gt;BusSort&lt;/strong&gt;. The results surprised even me.&lt;/p&gt;

&lt;p&gt;At 100 million elements, it runs &lt;strong&gt;~2x faster than Java's Dual-Pivot Quicksort&lt;/strong&gt; on random data — while being &lt;strong&gt;stable&lt;/strong&gt;. Dual-Pivot QS is not. And the generic variant beats &lt;strong&gt;TimSort by up to 8.9x&lt;/strong&gt; on duplicate-heavy data.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem With Quicksort at Scale
&lt;/h2&gt;

&lt;p&gt;Quicksort-based algorithms partition elements with random writes across the entire array. At large scales this causes &lt;strong&gt;cache thrashing&lt;/strong&gt; - elements are being written to memory locations all over the place, constantly missing L1 and L2 cache.&lt;/p&gt;

&lt;p&gt;The larger the array, the worse it gets.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Idea
&lt;/h2&gt;

&lt;p&gt;Instead of scattering elements globally, BusSort processes data in &lt;strong&gt;L1 cache-sized chunks&lt;/strong&gt; - 4096 integers (~16KB).&lt;/p&gt;

&lt;p&gt;For each chunk, it does 4 passes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PASS 1&lt;/strong&gt; - Scan left-to-right, compute bucket for each element, build a local histogram&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PASS 2&lt;/strong&gt; - Compute local prefix sums (bucket positions within the chunk)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PASS 3&lt;/strong&gt; - Scatter into a local grouped buffer - because this buffer is L1-sized, all random writes stay &lt;strong&gt;in cache&lt;/strong&gt; ✅&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PASS 4&lt;/strong&gt; - Copy each bucket's portion to its correct global position&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With &lt;strong&gt;128-way splitting&lt;/strong&gt;, recursion depth stays at just ~4 levels even for 100M elements.&lt;/p&gt;

&lt;p&gt;Base case: &lt;strong&gt;Insertion Sort&lt;/strong&gt; for ≤ 1024 elements.&lt;/p&gt;

&lt;p&gt;On the benchmark machine (i5-1135G7, 48KB L1 data cache):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;4096 × 3 × 4 bytes = 49,152 bytes ≈ 48KB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The three working arrays fit &lt;strong&gt;exactly&lt;/strong&gt; in L1. Not a coincidence.&lt;/p&gt;




&lt;h2&gt;
  
  
  Benchmark Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;int[]&lt;/code&gt; vs Dual-Pivot Quicksort
&lt;/h3&gt;

&lt;p&gt;Tested against &lt;code&gt;Arrays.sort(int[])&lt;/code&gt; - Java's &lt;strong&gt;Dual-Pivot Quicksort&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;n = 100,000,000 | Java 18 | i3-1115G4 @ 3.00GHz&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Methodology: JMH 1.37 — 5 warmup iterations, 10 measurement iterations, 3 JVM forks (30 data points per benchmark)&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Input Type&lt;/th&gt;
&lt;th&gt;BusSort (ms)&lt;/th&gt;
&lt;th&gt;± Error&lt;/th&gt;
&lt;th&gt;DPQ (ms)&lt;/th&gt;
&lt;th&gt;± Error&lt;/th&gt;
&lt;th&gt;Ratio&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Random&lt;/td&gt;
&lt;td&gt;3864&lt;/td&gt;
&lt;td&gt;±130&lt;/td&gt;
&lt;td&gt;8344&lt;/td&gt;
&lt;td&gt;±268&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;2.2x&lt;/strong&gt; ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Duplicates&lt;/td&gt;
&lt;td&gt;890&lt;/td&gt;
&lt;td&gt;±165&lt;/td&gt;
&lt;td&gt;2266&lt;/td&gt;
&lt;td&gt;±31&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;2.5x&lt;/strong&gt; ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Few Duplicates&lt;/td&gt;
&lt;td&gt;2232&lt;/td&gt;
&lt;td&gt;±273&lt;/td&gt;
&lt;td&gt;3382&lt;/td&gt;
&lt;td&gt;±62&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;1.5x&lt;/strong&gt; ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Clustered&lt;/td&gt;
&lt;td&gt;1527&lt;/td&gt;
&lt;td&gt;±35&lt;/td&gt;
&lt;td&gt;2394&lt;/td&gt;
&lt;td&gt;±44&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;1.6x&lt;/strong&gt; ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sorted&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;td&gt;±0.7&lt;/td&gt;
&lt;td&gt;33&lt;/td&gt;
&lt;td&gt;±1.3&lt;/td&gt;
&lt;td&gt;0.9x ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reverse&lt;/td&gt;
&lt;td&gt;191&lt;/td&gt;
&lt;td&gt;±2.9&lt;/td&gt;
&lt;td&gt;89&lt;/td&gt;
&lt;td&gt;±2.2&lt;/td&gt;
&lt;td&gt;0.5x ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nearly Sorted&lt;/td&gt;
&lt;td&gt;2726&lt;/td&gt;
&lt;td&gt;±190&lt;/td&gt;
&lt;td&gt;2303&lt;/td&gt;
&lt;td&gt;±33&lt;/td&gt;
&lt;td&gt;0.8x ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;All Same&lt;/td&gt;
&lt;td&gt;53&lt;/td&gt;
&lt;td&gt;±9.9&lt;/td&gt;
&lt;td&gt;35&lt;/td&gt;
&lt;td&gt;±0.5&lt;/td&gt;
&lt;td&gt;0.6x ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;4/8 input types faster. Stable. Zero comparison overhead.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The ± Error columns confirm these gaps are statistically significant — not noise. Full benchmark code is in the &lt;a href="https://github.com/dev-shreyaspatil/BusSort/tree/main/benchmarks" rel="noopener noreferrer"&gt;&lt;code&gt;benchmarks/&lt;/code&gt;&lt;/a&gt; folder if you want to reproduce.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;code&gt;T[]&lt;/code&gt; vs TimSort (Generic)
&lt;/h3&gt;

&lt;p&gt;Tested against &lt;code&gt;Arrays.sort(T[])&lt;/code&gt; - Java's &lt;strong&gt;TimSort&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;n = 40,000,000 | Java 17 | i5-1135G7 @ 2.40GHz&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Methodology: JMH 1.37 — 5 warmup iterations, 10 measurement iterations, 3 JVM forks (30 data points per benchmark)&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Input Type&lt;/th&gt;
&lt;th&gt;BusSort (ms)&lt;/th&gt;
&lt;th&gt;± Error&lt;/th&gt;
&lt;th&gt;TimSort (ms)&lt;/th&gt;
&lt;th&gt;± Error&lt;/th&gt;
&lt;th&gt;Ratio&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Random&lt;/td&gt;
&lt;td&gt;7357&lt;/td&gt;
&lt;td&gt;±162&lt;/td&gt;
&lt;td&gt;21691&lt;/td&gt;
&lt;td&gt;±277&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;2.9x&lt;/strong&gt; ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Duplicates&lt;/td&gt;
&lt;td&gt;619&lt;/td&gt;
&lt;td&gt;±8&lt;/td&gt;
&lt;td&gt;5510&lt;/td&gt;
&lt;td&gt;±55&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;8.9x&lt;/strong&gt; ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Few Duplicates&lt;/td&gt;
&lt;td&gt;3680&lt;/td&gt;
&lt;td&gt;±90&lt;/td&gt;
&lt;td&gt;7412&lt;/td&gt;
&lt;td&gt;±114&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;2.0x&lt;/strong&gt; ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Clustered&lt;/td&gt;
&lt;td&gt;1804&lt;/td&gt;
&lt;td&gt;±97&lt;/td&gt;
&lt;td&gt;5599&lt;/td&gt;
&lt;td&gt;±103&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;3.1x&lt;/strong&gt; ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;All Same&lt;/td&gt;
&lt;td&gt;34&lt;/td&gt;
&lt;td&gt;±0.6&lt;/td&gt;
&lt;td&gt;34&lt;/td&gt;
&lt;td&gt;±0.8&lt;/td&gt;
&lt;td&gt;~1x ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sorted&lt;/td&gt;
&lt;td&gt;63&lt;/td&gt;
&lt;td&gt;±0.2&lt;/td&gt;
&lt;td&gt;61&lt;/td&gt;
&lt;td&gt;±0.3&lt;/td&gt;
&lt;td&gt;~1x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nearly Sorted&lt;/td&gt;
&lt;td&gt;2541&lt;/td&gt;
&lt;td&gt;±103&lt;/td&gt;
&lt;td&gt;1835&lt;/td&gt;
&lt;td&gt;±160&lt;/td&gt;
&lt;td&gt;0.7x ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reverse&lt;/td&gt;
&lt;td&gt;652&lt;/td&gt;
&lt;td&gt;±11&lt;/td&gt;
&lt;td&gt;313&lt;/td&gt;
&lt;td&gt;±2&lt;/td&gt;
&lt;td&gt;0.5x ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;5/8 input types faster. Naturally stable. 8.9x on duplicates.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Full benchmark code is in the &lt;a href="https://github.com/dev-shreyaspatil/BusSort/tree/main/generics/benchmarks" rel="noopener noreferrer"&gt;&lt;code&gt;generics/benchmarks/&lt;/code&gt;&lt;/a&gt; folder if you want to reproduce.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why It's Stable
&lt;/h2&gt;

&lt;p&gt;BusSort preserves relative order of equal elements naturally — no extra bookkeeping needed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chunks are processed &lt;strong&gt;left-to-right&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Within each chunk, elements are scattered &lt;strong&gt;left-to-right&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;globalNext[b]&lt;/code&gt; advances per chunk - earlier chunks always land before later chunks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes BusSort one of very few algorithms that is both &lt;strong&gt;faster than Dual-Pivot Quicksort and TimSort&lt;/strong&gt; and &lt;strong&gt;stable&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For &lt;code&gt;int[]&lt;/code&gt;, equal integers are identical by value — stability is technically unobservable. However, the stable reverse path is intentionally preserved for consistency with the stability guarantee and to serve as a reference for future ports.&lt;/p&gt;




&lt;h2&gt;
  
  
  Trade-offs
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Requires &lt;strong&gt;O(n) auxiliary space&lt;/strong&gt; — standard for stable sorts&lt;/li&gt;
&lt;li&gt;Generic variant always allocates O(n) auxiliary space — TimSort is adaptive on structured inputs&lt;/li&gt;
&lt;li&gt;Slower on already-sorted, reverse-sorted, and nearly-sorted input — TimSort's run detection is unbeatable on these cases&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Dynamic stack sizing&lt;/li&gt;
&lt;li&gt;Auto-tune &lt;code&gt;BUS_SIZE&lt;/code&gt; based on runtime L1 cache size&lt;/li&gt;
&lt;li&gt;Parallel variant&lt;/li&gt;
&lt;li&gt;Formal write-up / paper&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  GitHub
&lt;/h2&gt;

&lt;p&gt;Full implementation, benchmarks, and generic variant published.&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://github.com/dev-shreyaspatil/BusSort" rel="noopener noreferrer"&gt;github.com/dev-shreyaspatil/BusSort&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;Curious what the Java community thinks — especially around the cache-chunking approach and whether there are similar published algorithms I should be comparing against.&lt;/p&gt;

</description>
      <category>algorithms</category>
      <category>java</category>
      <category>performance</category>
      <category>sorting</category>
    </item>
  </channel>
</rss>
