<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Joel Dsouza</title>
    <description>The latest articles on DEV Community by Joel Dsouza (@jdsouza).</description>
    <link>https://dev.to/jdsouza</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3869959%2F187beb94-da13-46bd-9b61-80314e9bac45.jpeg</url>
      <title>DEV Community: Joel Dsouza</title>
      <link>https://dev.to/jdsouza</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jdsouza"/>
    <language>en</language>
    <item>
      <title>How One Field in a Sort Query Brought Down Our OpenSearch Cluster</title>
      <dc:creator>Joel Dsouza</dc:creator>
      <pubDate>Fri, 10 Apr 2026 11:48:26 +0000</pubDate>
      <link>https://dev.to/jdsouza/how-one-field-in-a-sort-query-brought-down-our-opensearch-cluster-3fb</link>
      <guid>https://dev.to/jdsouza/how-one-field-in-a-sort-query-brought-down-our-opensearch-cluster-3fb</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We added &lt;code&gt;_id&lt;/code&gt; as a sort tie breaker to fix non deterministic pagination&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;_id&lt;/code&gt; is a metadata field with no doc values. Sorting on it loads everything into JVM heap via fielddata&lt;/li&gt;
&lt;li&gt;At scale, this caused JVM to spike to 98–99%, triggered circuit breakers, and flooded us with 429 errors&lt;/li&gt;
&lt;li&gt;The fix: use a properly mapped &lt;code&gt;keyword&lt;/code&gt; field with doc values instead&lt;/li&gt;
&lt;li&gt;Lesson: know what lives on heap and what doesn't before you sort on it&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Picture This
&lt;/h2&gt;

&lt;p&gt;You've just deployed what looks like a routine fix. A two line change. A sort tie breaker to handle non deterministic pagination. Nothing that would raise an eyebrow in code review.&lt;/p&gt;

&lt;p&gt;Minutes after the deploy, your monitoring lights up. JVM pressure is spiking. Errors are flooding in. Your OpenSearch cluster, which was perfectly healthy moments ago, is struggling to stay alive.&lt;/p&gt;

&lt;p&gt;You didn't change your data. You didn't change your infrastructure. You added one field to a sort query.&lt;/p&gt;

&lt;p&gt;That's exactly what happened to us.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Change
&lt;/h2&gt;

&lt;p&gt;We had a query sorting results by &lt;code&gt;@timestamp&lt;/code&gt; in descending order. The problem: when multiple documents share the same timestamp, their relative order is non deterministic. Paginated results were inconsistent. Documents would appear on different pages across requests.&lt;/p&gt;

&lt;p&gt;The standard fix for this is a tie breaker. A secondary sort on a unique field to guarantee consistent ordering. And &lt;code&gt;_id&lt;/code&gt;, the document's unique identifier, seemed like the obvious choice.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Primary sort&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"@timestamp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Order&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;reporting&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="c"&gt;// Tie breaker&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"_id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Order&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;reporting&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ASC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Clean. Logical. Merged. Deployed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fallout
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;JVM memory pressure jumped from ~84% to 98–99% almost immediately.&lt;/strong&gt; The fielddata cache was consuming heap faster than GC could reclaim it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F68ec8qxb6ypqd2ci21jk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F68ec8qxb6ypqd2ci21jk.png" alt="Maximum JVM Pressure" width="800" height="491"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Indexing latency climbed sharply as the JVM spent more time garbage collecting than actually indexing documents.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F63ddkk42mblriyg02s39.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F63ddkk42mblriyg02s39.png" alt="Indexing latency" width="800" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then the circuit breakers tripped. At peak, we saw &lt;strong&gt;over 4,000 4XX errors in a single minute&lt;/strong&gt;, almost all of them 429s. Writes were being dropped. Queries were timing out.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe0fnghlj1h2jxoy86i46.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe0fnghlj1h2jxoy86i46.png" alt="HTTP Requests by Response Code" width="800" height="507"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The cluster hadn't changed. Traffic hadn't changed. One field in a sort query did all of this.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why &lt;code&gt;_id&lt;/code&gt; Is the Wrong Field to Sort On
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;_id&lt;/code&gt; is a &lt;a href="https://docs.opensearch.org/latest/mappings/metadata-fields/id/" rel="noopener noreferrer"&gt;metadata field&lt;/a&gt;. It is managed by OpenSearch, not by you.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It exists purely for document identification. It is not indexed for search. It has no doc values. And that last part is what causes the problem.&lt;/p&gt;

&lt;p&gt;When OpenSearch sorts on a regular field, say a &lt;code&gt;keyword&lt;/code&gt; field, it reads from &lt;strong&gt;&lt;a href="https://docs.opensearch.org/latest/mappings/mapping-parameters/doc-values/" rel="noopener noreferrer"&gt;doc values&lt;/a&gt;&lt;/strong&gt;: an on disk, column oriented structure built at index time. Fast, efficient, zero heap cost.&lt;/p&gt;

&lt;p&gt;When OpenSearch needs to sort on &lt;code&gt;_id&lt;/code&gt;, there are no doc values to read from. So it falls back to &lt;strong&gt;&lt;a href="https://docs.opensearch.org/latest/search-plugins/caching/field-data-cache/" rel="noopener noreferrer"&gt;fielddata&lt;/a&gt;&lt;/strong&gt;, a data structure that gets built and held in &lt;strong&gt;JVM heap memory at query time&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;At small scale, you won't notice this. At production scale, every query that hits that sort is loading data into heap. The fielddata cache grows. GC pressure builds. The circuit breaker, which exists specifically to protect the cluster from memory exhaustion, eventually trips and starts rejecting requests.&lt;/p&gt;

&lt;p&gt;OpenSearch's own documentation is explicit about this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"If you need to sort by document ID, consider duplicating the ID value into another field with doc values enabled."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We were hitting a documented footgun. The kind that doesn't show up in staging, doesn't fail in code review, and only surfaces when real traffic hits it at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;p&gt;We already stored an &lt;code&gt;id&lt;/code&gt; field in our documents. We just weren't using it for sorting. The fix was replacing &lt;code&gt;_id&lt;/code&gt; (metadata, no doc values) with &lt;code&gt;id.keyword&lt;/code&gt; (a properly mapped keyword subfield with doc values enabled by default):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"@timestamp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Order&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;reporting&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"id.keyword"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Order&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;reporting&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ASC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the mapping defined in our index template:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"keyword"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"doc_values"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fields"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"keyword"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"keyword"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"ignore_above"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;id.keyword&lt;/code&gt; is a &lt;code&gt;keyword&lt;/code&gt; type. Keyword fields have doc values enabled by default. Sorting reads from disk, not heap. No fielddata. No memory explosion.&lt;/p&gt;

&lt;p&gt;We deployed. JVM came back down. Cluster stabilized.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Should Change About How You Think About Mappings
&lt;/h2&gt;

&lt;p&gt;This incident is really about two things: a wrong field choice, and a broader lesson about how OpenSearch physically stores and accesses data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Never sort on &lt;code&gt;_id&lt;/code&gt; or other metadata fields&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Metadata fields like &lt;code&gt;_id&lt;/code&gt;, &lt;code&gt;_index&lt;/code&gt;, and &lt;code&gt;_type&lt;/code&gt; are managed internally by OpenSearch. They don't have doc values. Sorting on them forces fielddata, and fielddata at scale is a liability. If you need to sort by document ID, store it explicitly in your mapping as a &lt;code&gt;keyword&lt;/code&gt; field.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Doc values vs fielddata — know the difference&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Doc Values&lt;/th&gt;
&lt;th&gt;Fielddata&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;On disk&lt;/td&gt;
&lt;td&gt;JVM heap&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Built&lt;/td&gt;
&lt;td&gt;At index time&lt;/td&gt;
&lt;td&gt;At query time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Performance&lt;/td&gt;
&lt;td&gt;Efficient&lt;/td&gt;
&lt;td&gt;Expensive at scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Default on&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;keyword&lt;/code&gt;, numeric, date&lt;/td&gt;
&lt;td&gt;Disabled by default&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Any field you intend to sort or aggregate on needs doc values. If a field doesn't have them, OpenSearch either refuses to sort on it or falls back to fielddata. Neither of which is what you want in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Use &lt;a href="https://docs.opensearch.org/latest/mappings/supported-field-types/keyword/" rel="noopener noreferrer"&gt;&lt;code&gt;keyword&lt;/code&gt;&lt;/a&gt; for fields that don't need full text search&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Dynamic mappings are convenient, but they're a trap. OpenSearch will often map string fields as &lt;code&gt;text&lt;/code&gt;, which is tokenized and analysed for full text search, but cannot be efficiently sorted or aggregated without a &lt;code&gt;keyword&lt;/code&gt; subfield.&lt;/p&gt;

&lt;p&gt;For fields like IDs, statuses, categories, and tags, anything where you don't need text search, define them explicitly as &lt;code&gt;keyword&lt;/code&gt; in your index templates. You get doc values for free, sorting works correctly, and there's no heap pressure.&lt;/p&gt;

&lt;p&gt;Explicit mappings are not overhead. They are the contract between your application and your cluster. Define them upfront, especially in high volume environments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;A two line change to a sort query caused a production OpenSearch cluster to spike to near 100% JVM pressure, trip circuit breakers, and drop writes, all because &lt;code&gt;_id&lt;/code&gt; doesn't have doc values.&lt;/p&gt;

&lt;p&gt;The fix was equally small. But making it required understanding that in OpenSearch, not all fields are equal. Some sort from disk, some sort from heap, and at scale that difference is everything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Know your mappings. Know your fields. Know what lives on heap and what doesn't.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  About the Author
&lt;/h2&gt;

&lt;p&gt;Hey, I'm Joel Dsouza, a software developer working at &lt;a href="https://aurva.io/" rel="noopener noreferrer"&gt;Aurva&lt;/a&gt;. I mostly work on backend systems and occasionally run into interesting problems worth writing about.&lt;/p&gt;

&lt;p&gt;If this resonated or you just want to connect, find me on &lt;a href="https://www.linkedin.com/in/joel-macklyn-dsouza/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>opensearch</category>
      <category>incident</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
