<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Prithvi S</title>
    <description>The latest articles on DEV Community by Prithvi S (@iprithv).</description>
    <link>https://dev.to/iprithv</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3869317%2Fe48d8dde-3457-4eca-881a-f414fac5b86e.jpg</url>
      <title>DEV Community: Prithvi S</title>
      <link>https://dev.to/iprithv</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/iprithv"/>
    <language>en</language>
    <item>
      <title>Credential Vending in Apache Polaris: Securing Data Access Without Sharing Keys</title>
      <dc:creator>Prithvi S</dc:creator>
      <pubDate>Thu, 30 Apr 2026 00:30:40 +0000</pubDate>
      <link>https://dev.to/iprithv/credential-vending-in-apache-polaris-securing-data-access-without-sharing-keys-3g6m</link>
      <guid>https://dev.to/iprithv/credential-vending-in-apache-polaris-securing-data-access-without-sharing-keys-3g6m</guid>
      <description>&lt;h1&gt;
  
  
  Credential Vending in Apache Polaris: Securing Data Access Without Sharing Keys
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;By Prithvi S – Staff Software Engineer at Cloudera&lt;/em&gt;  &lt;/p&gt;




&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In modern data architectures, managing who can access what data is a constant challenge. Traditional approaches rely on long‑lived access keys or service accounts that are difficult to rotate and can become a security liability. Apache Polaris tackles this problem head‑on with a built‑in &lt;strong&gt;credential vending&lt;/strong&gt; mechanism. Instead of distributing static keys, Polaris mints short‑lived, scoped credentials on demand, giving each request exactly the permissions it needs and expiring them after a few minutes.&lt;/p&gt;

&lt;p&gt;This post walks through the design, implementation, and benefits of credential vending in Polaris. It also shows how the feature integrates with the rest of the system, discusses best practices, and provides a practical example of using the API.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Credential Vending?
&lt;/h2&gt;

&lt;p&gt;Data engineers and scientists often need to read or write to cloud storage (S3, GCS, Azure) as part of their pipelines. Giving them permanent access keys creates several problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Key leakage&lt;/strong&gt; – a single compromised key can expose an entire bucket.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rotation overhead&lt;/strong&gt; – keys must be rotated regularly, which is operationally heavy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Principle of least privilege&lt;/strong&gt; – static keys usually have broad permissions, violating least‑privilege best practices.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Credential vending solves these issues by generating &lt;strong&gt;short‑lived, scoped tokens&lt;/strong&gt; that are tied to a specific operation (read‑only, read‑write) and a narrow resource path. Tokens expire after a configurable period (default ~15 minutes) and can be revoked instantly if needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;Below is a high‑level diagram of the credential vending flow (illustrated with a professional image from Unsplash – &lt;em&gt;placeholder&lt;/em&gt;):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/image1.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/image1.jpg" alt="Credential Vending Diagram" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Client Request&lt;/strong&gt; – An engine (Spark, Flink, Trino) sends an HTTP request to Polaris to perform an action on a table.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auth Check&lt;/strong&gt; – Polaris authorizes the request using its two‑tier RBAC model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage Lookup&lt;/strong&gt; – The system determines which cloud storage backend backs the catalog (S3, GCS, Azure).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Credential Minting&lt;/strong&gt; – Polaris calls the cloud provider’s token service (AWS STS, GCS token API, Azure AD) to create a temporary token with the exact permissions required.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Response&lt;/strong&gt; – The temporary credential is returned to the client, which uses it for the subsequent data operation.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Deep Dive: How Polaris Mints Credentials
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Authorization Layer
&lt;/h3&gt;

&lt;p&gt;Polaris first evaluates the request against its &lt;strong&gt;RBAC model&lt;/strong&gt;. The model consists of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Principal Roles&lt;/strong&gt; – assigned to users, service accounts, or automated agents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Catalog Roles&lt;/strong&gt; – define privileges on catalog objects (e.g., &lt;code&gt;TABLE_READ_DATA&lt;/code&gt;, &lt;code&gt;TABLE_WRITE_DATA&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PolarisAuthorizer&lt;/strong&gt; – resolves the effective privileges for the request.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Only if the request has the required privilege does Polaris proceed to credential vending.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Storage Integration
&lt;/h3&gt;

&lt;p&gt;Polaris supports three major cloud storage providers via the &lt;strong&gt;PolarisStorageIntegration&lt;/strong&gt; interface. Each implementation knows how to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Translate a &lt;strong&gt;credential scope&lt;/strong&gt; (e.g., &lt;code&gt;s3://my-bucket/path/&lt;/code&gt;) into a provider‑specific request.&lt;/li&gt;
&lt;li&gt;Call the provider’s temporary credential service.&lt;/li&gt;
&lt;li&gt;Apply any additional constraints (IP allow‑list, expiration window).&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  AWS Example
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;AssumeRoleRequest&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AssumeRoleRequest&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;roleArn&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;storageConfig&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getAwsRoleArn&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;durationSeconds&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;900&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;// 15 minutes&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;policy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scopedPolicy&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;// restrict to specific bucket/prefix&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="nc"&gt;Credentials&lt;/span&gt; &lt;span class="n"&gt;creds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;stsClient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;assumeRole&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;credentials&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  GCS Example
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;GoogleCredentials&lt;/span&gt; &lt;span class="n"&gt;scoped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GoogleCredentials&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createFromSecret&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;storageConfig&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getServiceAccountJson&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createScoped&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;of&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"https://www.googleapis.com/auth/devstorage.read_write"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createDelegated&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;storageConfig&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getServiceAccountEmail&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;span class="nc"&gt;AccessToken&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scoped&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;refreshAccessToken&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Token Construction and Caching
&lt;/h3&gt;

&lt;p&gt;After receiving the provider token, Polaris wraps it in a &lt;strong&gt;PolarisCredential&lt;/strong&gt; object that includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Provider name (aws, gcs, azure)&lt;/li&gt;
&lt;li&gt;Expiration timestamp&lt;/li&gt;
&lt;li&gt;Scoped resource path&lt;/li&gt;
&lt;li&gt;Original request ID for tracing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Polaris also caches tokens for a short window to reduce provider calls when identical scopes are requested repeatedly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Benefits in Real‑World Deployments
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benefit&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reduced Blast Radius&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Compromise of a short‑lived token limits exposure to a few minutes and a narrow path.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Automatic Revocation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tokens expire automatically; administrators can also invalidate the cache to force re‑minting.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Compliance Friendly&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Auditable token issuance logs simplify regulatory reporting.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Operational Simplicity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No need to rotate static keys; credential lifecycle is managed by Polaris.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Practical Example: Reading a Table from Spark
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.polaris.client.PolarisClient&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.SparkSession&lt;/span&gt;

&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;polaris&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;PolarisClient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"https://polaris.mycompany.com"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;authToken&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Bearer &amp;lt;user‑jwt&amp;gt;"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;build&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;cred&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;polaris&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getTemporaryCredential&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;catalog&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"analytics"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;namespace&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"sales"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"transactions"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;privilege&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"TABLE_READ_DATA"&lt;/span&gt;
&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// Spark can now read directly using the temporary S3 credentials&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;df&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;SparkSession&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;appName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"PolarisDemo"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getOrCreate&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;

&lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;read&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;format&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"iceberg"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;option&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"fs.s3a.access.key"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;cred&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;accessKey&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;option&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"fs.s3a.secret.key"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;cred&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;secretKey&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;option&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"fs.s3a.session.token"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;cred&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;sessionToken&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;load&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"s3://my‑bucket/analytics/sales/transactions"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;show&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Spark job never sees a permanent AWS key; it receives a scoped token that expires after 15 minutes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Best Practices for Using Credential Vending
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Limit Scope Aggressively&lt;/strong&gt; – Include the bucket and prefix that the request truly needs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set Short Expiration&lt;/strong&gt; – Default of 5‑15 minutes is usually sufficient for a data pipeline step.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache Wisely&lt;/strong&gt; – Enable short‑term caching to reduce provider latency, but ensure cache invalidation on role changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor Token Usage&lt;/strong&gt; – Polaris logs each token issuance; integrate with your observability stack to detect anomalies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rotate Underlying IAM Roles&lt;/strong&gt; – Even though tokens are short‑lived, the underlying IAM role should be rotated periodically.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Image Gallery
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Credential Vending Diagram&lt;/strong&gt; – Visualizes the flow from request to temporary token (placeholder image URL).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Polaris Dashboard Screenshot&lt;/strong&gt; – Shows the token issuance metrics in the admin UI (placeholder image URL).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/image2.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/image2.jpg" alt="Polaris Dashboard" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Apache Polaris’ credential vending mechanism provides a modern, secure alternative to static access keys. By issuing short‑lived, scoped tokens on demand, Polaris reduces the attack surface, simplifies compliance, and aligns with the principle of least privilege. As data pipelines continue to scale and integrate with multiple cloud providers, such dynamic credential management becomes a cornerstone of a robust data governance strategy.&lt;/p&gt;

&lt;p&gt;If you want to try it yourself, check out the Polaris GitHub repository and the official documentation. Feel free to reach out with questions or share your own experiences – secure data access is a community effort.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Author Bio&lt;/em&gt;: I'm Prithvi S, Staff Software Engineer at Cloudera and Open‑source Enthusiast. Follow my work on GitHub: &lt;a href="https://github.com/iprithv" rel="noopener noreferrer"&gt;https://github.com/iprithv&lt;/a&gt;&lt;/p&gt;

</description>
      <category>polaris</category>
      <category>security</category>
      <category>api</category>
      <category>cloud</category>
    </item>
    <item>
      <title>Improving Search Relevance in OpenSearch: A Practical Guide for Engineers</title>
      <dc:creator>Prithvi S</dc:creator>
      <pubDate>Mon, 27 Apr 2026 14:03:41 +0000</pubDate>
      <link>https://dev.to/iprithv/improving-search-relevance-in-opensearch-a-practical-guide-for-engineers-37k9</link>
      <guid>https://dev.to/iprithv/improving-search-relevance-in-opensearch-a-practical-guide-for-engineers-37k9</guid>
      <description>&lt;p&gt;OpenSearch powers search at scale for many organizations, but raw relevance scores often need fine‑tuning to match user expectations. The &lt;strong&gt;Search Relevance&lt;/strong&gt; plugin gives engineers a structured way to evaluate, experiment, and improve relevance without writing custom code for every change. In this post we walk through a complete workflow: from defining a query set to running experiments, measuring metrics, and applying the insights to boost search quality.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Why Search Relevance Matters
&lt;/h2&gt;

&lt;p&gt;Even the most powerful search engine can return results that feel irrelevant if the underlying scoring models are not aligned with the business problem. Users expect the most useful documents at the top, and a mismatch can increase bounce rates, reduce conversions, and erode trust. The Search Relevance plugin provides a repeatable process to &lt;strong&gt;measure&lt;/strong&gt; relevance, &lt;strong&gt;experiment&lt;/strong&gt; with configurations, and &lt;strong&gt;iterate&lt;/strong&gt; based on data‑driven metrics.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Core Concepts
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2.1 Query Sets
&lt;/h3&gt;

&lt;p&gt;A &lt;em&gt;query set&lt;/em&gt; is a collection of representative user queries that you want to evaluate. Each entry includes the query text, optional filters, and a unique identifier. Building a good query set is critical: it should cover the most common intents, edge cases, and any domain‑specific terminology.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.2 Experiments
&lt;/h3&gt;

&lt;p&gt;An &lt;em&gt;experiment&lt;/em&gt; ties a query set to one or more &lt;em&gt;search configurations&lt;/em&gt;. A configuration may adjust analyzers, boosting rules, or ranking functions. Experiments run the queries against each configuration and collect judgments for every result.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.3 Judgments
&lt;/h3&gt;

&lt;p&gt;Judgments capture the perceived relevance of a document for a given query. They can be &lt;strong&gt;explicit&lt;/strong&gt; (human annotators rating relevance on a scale) or &lt;strong&gt;implicit&lt;/strong&gt; (click‑through, dwell time). The plugin stores judgments in internal system indexes, making them available for metric computation.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Building a Search Quality Evaluation Pipeline
&lt;/h2&gt;

&lt;p&gt;The following steps outline a practical pipeline that you can adapt to any OpenSearch cluster.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.1 Create the Query Set
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;### Example JSON for a query set stored in index .search-relevance-query-set&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"http://localhost:9200/.search-relevance-query-set/_doc"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "name": "ecommerce‑top‑queries",
    "queries": [
      {"id": "q1", "query": "wireless headphones"},
      {"id": "q2", "query": "best laptop for developers"},
      {"id": "q3", "query": "budget travel insurance"}
    ]
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.2 Define Search Configurations
&lt;/h3&gt;

&lt;p&gt;Each configuration lives in index &lt;code&gt;.search-relevance-config&lt;/code&gt;. You can adjust analyzers, boost fields, or enable function scoring.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"http://localhost:9200/.search-relevance-config/_doc"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "name": "baseline",
    "settings": {"boost": 1.0},
    "analyzer": "standard"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create additional configs (e.g., &lt;code&gt;custom‑boost‑titles&lt;/code&gt;) to compare against the baseline.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.3 Launch the Experiment
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"http://localhost:9200/.search-relevance-experiment/_doc"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "name": "Q1‑baseline‑vs‑custom",
    "query_set": "ecommerce‑top‑queries",
    "configs": ["baseline", "custom‑boost‑titles"],
    "type": "PAIRWISE_COMPARISON"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The experiment moves to &lt;strong&gt;RUNNING&lt;/strong&gt; state and the plugin orchestrates query execution across the selected configs.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Collecting Judgments
&lt;/h2&gt;

&lt;p&gt;The UI of the Search Relevance Dashboards presents side‑by‑side results for each config. Evaluators choose a relevance rating (e.g., 0‑3) for each document. Implicit signals such as click‑through are also recorded automatically.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Tip:&lt;/strong&gt; Use a small group of domain experts for explicit judgments and supplement with implicit data from production traffic.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  5. Computing Metrics
&lt;/h2&gt;

&lt;p&gt;After the experiment completes, the plugin calculates common relevance metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;nDCG (Normalized Discounted Cumulative Gain)&lt;/strong&gt; – accounts for position bias and relevance grades.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Precision@k&lt;/strong&gt; – proportion of relevant results in the top &lt;em&gt;k&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recall@k&lt;/strong&gt; – coverage of relevant documents in the top &lt;em&gt;k&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MRR (Mean Reciprocal Rank)&lt;/strong&gt; – average of the reciprocal rank of the first relevant result.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The results are stored in index &lt;code&gt;.search-relevance-metrics&lt;/code&gt; and can be queried via the REST API.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Interpreting the Results
&lt;/h2&gt;

&lt;p&gt;A typical output looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;config               nDCG   Precision@5   MRR
baseline             0.72   0.48           0.55
custom‑boost‑titles   0.79   0.55           0.62
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Higher scores indicate better alignment with the judged relevance. In this example, boosting titles improves nDCG by 0.07 and lifts Precision@5 by 7 percentage points.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Applying the Findings
&lt;/h2&gt;

&lt;h3&gt;
  
  
  7.1 Update Index Settings
&lt;/h3&gt;

&lt;p&gt;Based on the metrics, you might adjust the &lt;code&gt;search.analyzer&lt;/code&gt; or add custom scoring scripts. For the title‑boost example, you could add a field‑level boost in the query DSL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"function_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"match"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{{query}}"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"field_value_factor"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"field"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"title_boost"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"factor"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  7.2 Re‑run the Experiment
&lt;/h3&gt;

&lt;p&gt;After applying changes, rerun the experiment to verify the impact. This iterative loop ensures that each tweak produces measurable gains.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Real‑World Case Study: Fixing Bad Results
&lt;/h2&gt;

&lt;p&gt;A media platform noticed that users searching for &lt;em&gt;"latest tech news"&lt;/em&gt; were frequently seeing older articles. The team:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Defined a query set focused on timeliness.&lt;/li&gt;
&lt;li&gt;Ran a baseline experiment and observed low nDCG (0.48).&lt;/li&gt;
&lt;li&gt;Added a &lt;strong&gt;recency boost&lt;/strong&gt; using a decay function.&lt;/li&gt;
&lt;li&gt;Re‑ran the experiment and saw nDCG rise to 0.73.&lt;/li&gt;
&lt;li&gt;Deployed the new config to production, resulting in a 15 % increase in click‑through rate for the targeted queries.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  9. Visual Summary
&lt;/h2&gt;

&lt;p&gt;End‑to‑end workflow from query set creation to metric analysis.*&lt;/p&gt;




&lt;h2&gt;
  
  
  10. Best Practices
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Keep query sets small but representative&lt;/strong&gt; – 30‑50 queries often provide enough signal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use both explicit and implicit judgments&lt;/strong&gt; – they complement each other.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version your search configurations&lt;/strong&gt; – the plugin stores them as immutable objects, making rollback trivial.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automate metric extraction&lt;/strong&gt; – integrate a CI step that fetches the latest metrics after each experiment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document decisions&lt;/strong&gt; – store rationale in the experiment description field for future reference.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  11. Conclusion
&lt;/h2&gt;

&lt;p&gt;Search relevance is an ongoing discipline, not a one‑time setting. The OpenSearch Search Relevance plugin gives you a repeatable, data‑driven process to measure and improve relevance. By defining clear query sets, running controlled experiments, and acting on concrete metrics, you can turn vague user complaints into quantifiable improvements.&lt;/p&gt;




&lt;h2&gt;
  
  
  About the author
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prithvi S&lt;/strong&gt; – Staff Software Engineer at Cloudera and Open‑source enthusiast. Follow my work on GitHub: &lt;a href="https://github.com/iprithv" rel="noopener noreferrer"&gt;https://github.com/iprithv&lt;/a&gt;&lt;/p&gt;

</description>
      <category>iceberg</category>
      <category>data</category>
      <category>architecture</category>
      <category>database</category>
    </item>
    <item>
      <title>The Metadata Tree: How Apache Iceberg Tracks Everything Without a Database</title>
      <dc:creator>Prithvi S</dc:creator>
      <pubDate>Fri, 24 Apr 2026 00:31:47 +0000</pubDate>
      <link>https://dev.to/iprithv/the-metadata-tree-how-apache-iceberg-tracks-everything-without-a-database-3111</link>
      <guid>https://dev.to/iprithv/the-metadata-tree-how-apache-iceberg-tracks-everything-without-a-database-3111</guid>
      <description>&lt;p&gt;If you've worked with petabyte-scale data lakes, you've felt the pain: tables fragment into thousands of small files, query engines can't find what they need, schema changes become nightmares, and concurrent writes collide silently. Apache Iceberg solves these problems with something deceptively simple: a metadata hierarchy that tracks every file, every change, and every snapshot without needing a centralized database.&lt;/p&gt;

&lt;p&gt;In this post, we'll walk through Iceberg's architecture from the ground up. By the end, you'll understand why Netflix built this, how it works, and why it's becoming the standard for analytics across Spark, Trino, Flink, and beyond.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Why Big Data Tables Are So Fragile
&lt;/h2&gt;

&lt;p&gt;Before Iceberg, data lakes used simple file layouts (Parquet/ORC on HDFS or S3). Sounds fine in theory. In practice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Atomicity fails.&lt;/strong&gt; A write dies halfway through. Are those partial files valid? Which ones should queries skip?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partition discovery is slow.&lt;/strong&gt; Query engines scan directory listings to find what data exists. With millions of files, this takes minutes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema changes break old files.&lt;/strong&gt; Add a column, and now old data files don't have it. Readers fail or invent NULLs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Concurrent writes collide.&lt;/strong&gt; Two writers claim they "finished" the same table simultaneously. Which one won? Corruption.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pruning is guesswork.&lt;/strong&gt; Engines don't know column ranges within files. They scan everything.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hive, Delta, and early Iceberg prototypes tried band-aids (metadata services, loose conventions). They all had the same fundamental issue: &lt;strong&gt;the catalog was either centralized (slow, inconsistent) or decentralized (unreliable)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Iceberg took a different approach: &lt;em&gt;immutable snapshots and a metadata hierarchy, with a single atomic pointer as the source of truth&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Meet the Metadata Tree
&lt;/h2&gt;

&lt;p&gt;Here's the insight: instead of asking "what files exist?", Iceberg asks "what does this snapshot contain?". A snapshot is an immutable point-in-time view of the table. Every write creates a new snapshot. Every snapshot points to a set of files. And Iceberg tracks all of this through a beautifully layered hierarchy.&lt;/p&gt;

&lt;p&gt;Let's build it from the bottom up:&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Data Files (Parquet/ORC/Avro)
&lt;/h3&gt;

&lt;p&gt;The base layer is simple: actual data. Files contain table rows in columnar format (Parquet is standard, ORC supported, Avro optional).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;s3://my-warehouse/my-table/data/
  00000-abc123.parquet  (1 GB, 10M rows, partition_date=2024-01-01)
  00001-def456.parquet  (950 MB, 9.5M rows, partition_date=2024-01-01)
  00002-ghi789.parquet  (1.1 GB, 11M rows, partition_date=2024-01-02)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key property:&lt;/strong&gt; These files are write-once. Once created, they never change. If you delete a row, you don't rewrite the file (we'll see how Iceberg handles deletes later). This makes them safe for concurrent reads and enables efficient caching.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Manifest Files (Track Data Files and Stats)
&lt;/h3&gt;

&lt;p&gt;Now here's where Iceberg gets clever. Instead of listing files via S3 API calls (slow, inconsistent), Iceberg writes a &lt;em&gt;manifest file&lt;/em&gt; for each snapshot. A manifest is a Parquet file that lists all data files in that snapshot along with their metadata.&lt;/p&gt;

&lt;p&gt;For each data file, a manifest stores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;File path:&lt;/strong&gt; s3://my-warehouse/my-table/data/00000-abc123.parquet&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File format:&lt;/strong&gt; parquet&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File size:&lt;/strong&gt; 1,073,741,824 bytes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partition info:&lt;/strong&gt; {partition_date: 2024-01-01} (as integers, not strings)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Record count:&lt;/strong&gt; 10,000,000&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Column stats:&lt;/strong&gt; min/max/null_count for each column

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;user_id&lt;/code&gt;: min=1000, max=9999999, null_count=0&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;amount&lt;/code&gt;: min=0.01, max=9999.99, null_count=1200&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;timestamp&lt;/code&gt;: min=2024-01-01 00:00:00, max=2024-01-01 23:59:59, null_count=0&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Why stats? Because query engines use them for &lt;em&gt;predicate pushdown&lt;/em&gt;. If you ask "SELECT * WHERE amount &amp;gt; 5000", the engine checks the manifest: "Does any file have max amount &amp;gt;= 5000?". If no, skip the entire file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;s3://my-warehouse/my-table/metadata/
  manifests/
    20240501-000001-abc.avro  (manifest for snapshot 1, 500 KB)
    20240502-000002-def.avro  (manifest for snapshot 2, 510 KB)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Multiple snapshots = multiple manifests.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Manifest List (Snapshot-Level Index)
&lt;/h3&gt;

&lt;p&gt;One manifest per snapshot might be thousands of files. Iceberg adds another layer: the &lt;em&gt;manifest list&lt;/em&gt;. This is a small Parquet file that lists which manifests belong to a given snapshot.&lt;/p&gt;

&lt;p&gt;For each manifest, it stores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Manifest file path:&lt;/strong&gt; s3://my-warehouse/my-table/metadata/manifests/20240501-000001-abc.avro&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manifest length:&lt;/strong&gt; 500,000 bytes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partition stats:&lt;/strong&gt; min/max partition values across all files in this manifest

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;partition_date&lt;/code&gt;: min=2024-01-01, max=2024-01-01&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Record count:&lt;/strong&gt; Sum of all files in this manifest&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;So if your snapshot includes 100 manifests, the manifest list is a ~50 KB file that summarizes all of them. Engines can scan the manifest list to decide which manifests to read.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;s3://my-warehouse/my-table/metadata/
  snap-20240501000001.avro  (manifest list for snapshot 1, 25 KB)
  snap-20240502000002.avro  (manifest list for snapshot 2, 26 KB)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Layer 4: Metadata File (Table Schema, History, Current Snapshot)
&lt;/h3&gt;

&lt;p&gt;The metadata file is a JSON file that contains the &lt;em&gt;entire truth&lt;/em&gt; about the table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"format-version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"table-uuid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"12345-abcde"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"location"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3://my-warehouse/my-table"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"last-updated-ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1714605600000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"last-column-id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"schema"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"struct"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"fields"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"long"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"email"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"double"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"created_at"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"country"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"partition-spec"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"date_partition"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"transform"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"year"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"source-id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"current-snapshot-id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"snapshots"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"snapshot-id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"timestamp-ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1714519200000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"operation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"append"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"added-files"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"100"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"manifest-list"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3://my-warehouse/my-table/metadata/snap-20240501000001.avro"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"snapshot-id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"timestamp-ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1714605600000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"operation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"append"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"added-files"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"5"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"manifest-list"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3://my-warehouse/my-table/metadata/snap-20240502000002.avro"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"properties"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"write.format.default"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"parquet"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;See &lt;code&gt;current-snapshot-id: 2&lt;/code&gt;? That pointer tells engines: "If you want the latest table, load snapshot 2, which points to this manifest list, which lists these manifests, which contain these files."&lt;/p&gt;

&lt;p&gt;The metadata file also contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Schema:&lt;/strong&gt; Column definitions and IDs (more on this later)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partition spec:&lt;/strong&gt; How to partition new data (year, month, day, hour, bucket, truncate)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snapshot history:&lt;/strong&gt; Every snapshot ever created, its timestamp, and what manifest-list it uses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Properties:&lt;/strong&gt; Table config (compression, default format, etc.)
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;s3://my-warehouse/my-table/metadata/
  00000.json  (metadata file v0, created at table init)
  00001.json  (metadata file v1, created after first write)
  00002.json  (metadata file v2, created after second write)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each write creates a new metadata file.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 5: Catalog Pointer (The Only Mutable Piece)
&lt;/h3&gt;

&lt;p&gt;Finally, the catalog. This is where Iceberg gets its reliability. The catalog is a simple mapping:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;table_name -&amp;gt; current_metadata_file_location
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my-table -&amp;gt; s3://my-warehouse/my-table/metadata/00002.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. One pointer. Everything else is immutable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How does the catalog live?&lt;/strong&gt; It doesn't live in S3. It lives in a &lt;em&gt;catalog service&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hive metastore:&lt;/strong&gt; Legacy, still widely used&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Glue:&lt;/strong&gt; Managed, but slower&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nessie:&lt;/strong&gt; Git-like, multi-branch table versioning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;REST catalog:&lt;/strong&gt; HTTP-based, engine-agnostic (Apache Polaris)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iceberg catalog:&lt;/strong&gt; In-memory, for testing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When you write to a table:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Load current metadata file (via catalog pointer)&lt;/li&gt;
&lt;li&gt;Generate new data files and manifests&lt;/li&gt;
&lt;li&gt;Create a new metadata file that points to the new manifests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Atomically update the catalog pointer&lt;/strong&gt; from old metadata to new metadata&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Step 4 is the secret sauce. Most catalog services support atomic CAS (compare-and-swap) operations. If two writers race, only one CAS succeeds. The loser retries with the new current state. This gives you &lt;strong&gt;serializable isolation&lt;/strong&gt; without distributed locks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting It Together: A Write Operation
&lt;/h2&gt;

&lt;p&gt;Let's trace a real write to see all five layers in action.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Initial state:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Catalog: &lt;code&gt;my-table -&amp;gt; metadata/00000.json&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Metadata 00000: current-snapshot-id=1, 100 data files&lt;/li&gt;
&lt;li&gt;Snapshot 1: manifest-list points to 10 manifests&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Writer appends 5 new files:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Writer generates 5 new Parquet files in the data directory&lt;/li&gt;
&lt;li&gt;Writer creates a new manifest (manifest v1) listing these 5 files + stats&lt;/li&gt;
&lt;li&gt;Writer creates a new manifest-list (snap-2) that includes both old manifests and the new one&lt;/li&gt;
&lt;li&gt;Writer creates new metadata file (00001.json):

&lt;ul&gt;
&lt;li&gt;current-snapshot-id: 2&lt;/li&gt;
&lt;li&gt;snapshot 2 points to new manifest-list&lt;/li&gt;
&lt;li&gt;schema unchanged&lt;/li&gt;
&lt;li&gt;partition-spec unchanged&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Atomic CAS:&lt;/strong&gt; Writer tells catalog: "Update my-table to point to metadata/00001.json (only if it still points to 00000.json)"&lt;/li&gt;
&lt;li&gt;Catalog confirms: pointer updated&lt;/li&gt;
&lt;li&gt;Write succeeds. Old metadata file (00000.json) stays on disk for history/time travel&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Concurrent reader:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sees catalog still pointing to 00000.json (old snapshot)&lt;/li&gt;
&lt;li&gt;Reads that snapshot's manifest-list&lt;/li&gt;
&lt;li&gt;Reads 10 manifests, finds 100 files&lt;/li&gt;
&lt;li&gt;Scans those 100 files&lt;/li&gt;
&lt;li&gt;Write doesn't interfere; reader gets consistent view&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Later reader:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sees catalog now pointing to 00001.json&lt;/li&gt;
&lt;li&gt;Reads snapshot 2's manifest-list&lt;/li&gt;
&lt;li&gt;Finds 11 manifests (10 old + 1 new)&lt;/li&gt;
&lt;li&gt;Reads 105 files total&lt;/li&gt;
&lt;li&gt;Sees the 5 new rows&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why This Design Is Brilliant
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Atomicity without transactions:&lt;/strong&gt; Only the catalog pointer moves. Everything else is immutable. No metadata locks, no two-phase commit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Snapshot isolation:&lt;/strong&gt; Readers hold a reference to a metadata file (or snapshot ID). Old snapshots never change. Even if 10 new writes happen, that reader's view stays frozen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Time travel:&lt;/strong&gt; Query engine wants rows from May 1st? Load the snapshot from May 1st, read &lt;em&gt;that&lt;/em&gt; manifest-list and files. No replaying transactions; the snapshot already exists.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. File pruning:&lt;/strong&gt; Manifest files store column stats. Query engines skip files without matching data before ever touching S3/HDFS. Orders of magnitude faster than directory scans.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Partition evolution:&lt;/strong&gt; Want to change from daily to hourly partitions? Create new metadata file with new partition-spec. Old data files keep old partition values (stored in manifest). New files use new partition values. Readers handle both transparently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Schema evolution:&lt;/strong&gt; Columns are identified by ID, not name. Rename a column? New metadata file with updated schema. Old files still have ID 1 for user_id; readers understand both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. Concurrent writes at scale:&lt;/strong&gt; Thousands of writers, zero coordination. Optimistic locking via metadata CAS. If you have 10% write collision rate, 99% of writers succeed on first try.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Performance Trade-off
&lt;/h2&gt;

&lt;p&gt;This architecture does have a cost: metadata reads. Every query must:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Load metadata file (JSON from S3)&lt;/li&gt;
&lt;li&gt;Load manifest-list (Parquet)&lt;/li&gt;
&lt;li&gt;Load relevant manifests (Parquet)&lt;/li&gt;
&lt;li&gt;Filter down to data files&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For a table with millions of files, even with parallelism, this adds 100-500ms to query startup. This is why Iceberg works best with a &lt;em&gt;warm metadata cache&lt;/em&gt; (many query engines cache manifest files in memory) and why &lt;em&gt;partition pruning is critical&lt;/em&gt; (manifest-list stores partition bounds, so you can skip entire manifest groups).&lt;/p&gt;

&lt;h2&gt;
  
  
  Next Steps in the Iceberg Journey
&lt;/h2&gt;

&lt;p&gt;Now that you understand the metadata tree, you're ready for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Schema evolution:&lt;/strong&gt; How field IDs make ALTER TABLE safe&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hidden partitioning:&lt;/strong&gt; How Iceberg auto-applies transforms (year, month, day, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delete semantics:&lt;/strong&gt; Position deletes vs equality deletes, Copy-on-Write vs Merge-on-Read&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimistic concurrency:&lt;/strong&gt; How CAS prevents write collisions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time travel:&lt;/strong&gt; Querying historical snapshots by timestamp&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Apache Iceberg's genius is in its simplicity: a five-layer metadata tree with one atomic pointer at the top. No central database. No complex consistency protocols. Just immutable files, clever indexing, and compare-and-swap semantics.&lt;/p&gt;

&lt;p&gt;If you're building data platforms at scale, this is the architecture you should understand. Whether you're using Spark, Trino, Flink, or any other engine, Iceberg's design enables correctness, performance, and flexibility that older table formats simply can't match.&lt;/p&gt;

&lt;p&gt;Start with a test table in your warehouse. Try time travel. Try schema evolution. Feel how different Iceberg is. Then you'll really get why it matters.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Want to dive deeper?&lt;/strong&gt; Check out the Apache Iceberg spec at &lt;a href="https://iceberg.apache.org/spec/" rel="noopener noreferrer"&gt;https://iceberg.apache.org/spec/&lt;/a&gt; and the source code at &lt;a href="https://github.com/apache/iceberg" rel="noopener noreferrer"&gt;https://github.com/apache/iceberg&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;About the Author&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I'm Prithvi S, Staff Software Engineer at Cloudera and Open Source Enthusiast. I work on data systems, LLM-powered applications, and large-scale architectures. Follow my work on GitHub: &lt;a href="https://github.com/iprithv" rel="noopener noreferrer"&gt;https://github.com/iprithv&lt;/a&gt;&lt;/p&gt;

</description>
      <category>iceberg</category>
      <category>data</category>
      <category>architecture</category>
      <category>database</category>
    </item>
    <item>
      <title>How OpenSearch Plugins Really Work: Architecture &amp; Extension Points</title>
      <dc:creator>Prithvi S</dc:creator>
      <pubDate>Thu, 23 Apr 2026 15:05:38 +0000</pubDate>
      <link>https://dev.to/iprithv/how-opensearch-plugins-really-work-architecture-extension-points-21nd</link>
      <guid>https://dev.to/iprithv/how-opensearch-plugins-really-work-architecture-extension-points-21nd</guid>
      <description>&lt;p&gt;OpenSearch is powerful out of the box, but its true flexibility comes from plugins. Yet most developers treat plugins as black boxes: you install them, they work, and you move on. But what if you need to build one? Or understand why a plugin broke after an upgrade? Or design a system that integrates with OpenSearch's plugin ecosystem?&lt;/p&gt;

&lt;p&gt;In this post, I'll walk you through how plugins actually work: compilation, packaging, installation, and the extension points that make customization possible. By the end, you'll understand the mechanics well enough to build your own.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Plugin Lifecycle: From Source to Running Code
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Writing and Compiling a Plugin
&lt;/h3&gt;

&lt;p&gt;A plugin is a Java project with dependencies on OpenSearch core. At minimum, you need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gradle"&gt;&lt;code&gt;&lt;span class="k"&gt;dependencies&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;compileOnly&lt;/span&gt; &lt;span class="s2"&gt;"org.opensearch:opensearch:${opensearch_version}"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;compileOnly&lt;/code&gt; is critical: your plugin compiles against OpenSearch, but doesn't bundle it. The plugin will run inside the OpenSearch JVM, using the host's core libraries.&lt;/p&gt;

&lt;p&gt;Your plugin entry point is a class that extends &lt;code&gt;Plugin&lt;/code&gt;. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyCustomPlugin&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Plugin&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="nc"&gt;SearchPlugin&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;QuerySpec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;?&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;getQueries&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonList&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;QuerySpec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="nc"&gt;MyCustomQuery&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;NAME&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nl"&gt;MyCustomQuery:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;MyCustomQuery&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromXContent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This simple declaration tells OpenSearch: "I provide a custom query type called &lt;code&gt;my_custom_query&lt;/code&gt;."&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Building the Plugin Artifact
&lt;/h3&gt;

&lt;p&gt;When you run &lt;code&gt;gradle build&lt;/code&gt;, you produce a .zip file containing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my-plugin-1.0.0.zip
├── opensearch-plugin-descriptor.properties
├── lib/
│   ├── my-plugin-1.0.0.jar
│   └── my-dependencies.jar (if any third-party libs needed)
├── bin/ (optional: scripts)
└── config/ (optional: default settings)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;opensearch-plugin-descriptor.properties&lt;/code&gt; file is the plugin manifest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;name&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;my-custom-plugin&lt;/span&gt;
&lt;span class="py"&gt;description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;My custom query plugin&lt;/span&gt;
&lt;span class="py"&gt;version&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;1.0.0&lt;/span&gt;
&lt;span class="py"&gt;opensearch.version&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;2.13.0&lt;/span&gt;
&lt;span class="py"&gt;java.version&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;11&lt;/span&gt;
&lt;span class="py"&gt;classname&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;com.example.MyCustomPlugin&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This manifest declares: which OpenSearch version the plugin targets, what Java version it needs, and crucially, the entry point class name.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Installation via the opensearch-plugin Tool
&lt;/h3&gt;

&lt;p&gt;You install via CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./bin/opensearch-plugin &lt;span class="nb"&gt;install &lt;/span&gt;file:///path/to/my-plugin-1.0.0.zip
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tool does several things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Verifies the manifest&lt;/strong&gt; — reads &lt;code&gt;opensearch-plugin-descriptor.properties&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version checks&lt;/strong&gt; — ensures plugin targets the installed OpenSearch version&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extracts&lt;/strong&gt; — unpacks to &lt;code&gt;plugins/my-custom-plugin/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Loads classes&lt;/strong&gt; — prepares the plugin for JVM loading&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Restarts the node&lt;/strong&gt; — required to load the plugin code&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After restart, your plugin code is live.&lt;/p&gt;

&lt;h2&gt;
  
  
  Class Loader Isolation and Bootstrap
&lt;/h2&gt;

&lt;p&gt;Here's where it gets interesting. Your plugin code runs in the same JVM as OpenSearch core. How does OpenSearch prevent your plugin from accidentally (or maliciously) breaking core?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Class Loader Isolation:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;OpenSearch uses a custom &lt;code&gt;PluginClassLoader&lt;/code&gt; for each plugin. This loader is a child of the core class loader, but has its own namespace:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Core classes (org.opensearch.*) resolve from the main class loader&lt;/li&gt;
&lt;li&gt;Plugin classes resolve from the plugin's class loader first&lt;/li&gt;
&lt;li&gt;If a class isn't found in the plugin loader, it falls back to core&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This prevents version conflicts. If your plugin wants to use a specific version of a library, it can bundle it, and its class loader will find that version first without conflicting with core.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bootstrap Contract:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When OpenSearch starts, it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Discovers all plugins in &lt;code&gt;plugins/&lt;/code&gt; directory&lt;/li&gt;
&lt;li&gt;Reads each plugin's descriptor&lt;/li&gt;
&lt;li&gt;Creates a &lt;code&gt;PluginClassLoader&lt;/code&gt; for each&lt;/li&gt;
&lt;li&gt;Instantiates each plugin's entry point class via reflection&lt;/li&gt;
&lt;li&gt;Calls lifecycle methods: &lt;code&gt;onIndexModule()&lt;/code&gt;, &lt;code&gt;onNodeStarted()&lt;/code&gt;, etc.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If a plugin fails to load, OpenSearch will refuse to start. This is intentional: it's safer to fail loudly than to silently omit a plugin that applications might depend on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Extension Points: How Plugins Hook Into OpenSearch
&lt;/h2&gt;

&lt;p&gt;A plugin doesn't have direct access to internal OpenSearch code. Instead, it implements well-defined &lt;strong&gt;extension point interfaces&lt;/strong&gt;. OpenSearch discovers these implementations and calls them at the right moments.&lt;/p&gt;

&lt;h3&gt;
  
  
  SearchPlugin: Custom Query Types and Aggregations
&lt;/h3&gt;

&lt;p&gt;The most common extension point for search-focused plugins:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MySearchPlugin&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Plugin&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="nc"&gt;SearchPlugin&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;QuerySpec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;?&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;getQueries&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Register custom query types&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonList&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;QuerySpec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="nc"&gt;MyQuery&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;NAME&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nl"&gt;MyQuery:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;MyQuery&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromXContent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;AggregationSpec&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;getAggregations&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Register custom aggregations&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonList&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;AggregationSpec&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;MyAggregation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;NAME&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nl"&gt;MyAggregation:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;MyAggregation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;parse&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;ScoreFunctionSpec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;?&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;getScoreFunctions&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Register custom scoring functions&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonList&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ScoreFunctionSpec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="nc"&gt;MyScoreFunction&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;NAME&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nl"&gt;MyScoreFunction:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;MyScoreFunction&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;parse&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once registered, your custom query is available via the REST API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;GET&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/my-index/_search&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"my_custom_query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"field"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"boost"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  ActionPlugin: Custom REST and Transport Actions
&lt;/h3&gt;

&lt;p&gt;For plugins that need custom REST endpoints or transport operations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyActionPlugin&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Plugin&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="nc"&gt;ActionPlugin&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;ActionHandler&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;?,&lt;/span&gt; &lt;span class="o"&gt;?&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;getActions&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonList&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ActionHandler&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="nc"&gt;MyAction&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;INSTANCE&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;TransportMyAction&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;RestHandler&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;getRestHandlers&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Settings&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;RestController&lt;/span&gt; &lt;span class="n"&gt;restController&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; 
            &lt;span class="nc"&gt;ClusterSettings&lt;/span&gt; &lt;span class="n"&gt;clusterSettings&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;IndexScopedSettings&lt;/span&gt; &lt;span class="n"&gt;indexScopedSettings&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="nc"&gt;SettingsFilter&lt;/span&gt; &lt;span class="n"&gt;settingsFilter&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;NamedWriteableRegistry&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;namedWriteableRegistries&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;NamedXContentRegistry&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;namedXContentRegistries&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Supplier&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;DiscoveryNodes&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;nodesInCluster&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="nc"&gt;Supplier&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;ClusterState&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;clusterStateSupplier&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonList&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;RestMyHandler&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you can hit a custom endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;POST /_plugin/my-action
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"param1"&lt;/span&gt;: &lt;span class="s2"&gt;"value"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  MapperPlugin: Custom Field Types
&lt;/h3&gt;

&lt;p&gt;If you need a new field type (beyond standard text, keyword, numeric, etc.):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyMapperPlugin&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Plugin&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="nc"&gt;MapperPlugin&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Mapper&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;TypeParser&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;getMappers&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonMap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="s"&gt;"my_custom_field"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parserContext&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;MyCustomFieldMapper&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parserContext&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you can use it in mappings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;PUT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/my-index&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mappings"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"properties"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"custom_field"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my_custom_field"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"analyzer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"standard"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  EnginePlugin: Custom Lucene Behavior
&lt;/h3&gt;

&lt;p&gt;For advanced use cases, you can hook into the Lucene engine itself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyEnginePlugin&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Plugin&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="nc"&gt;EnginePlugin&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;Optional&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;EngineFactory&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;getEngineFactory&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;IndexSettings&lt;/span&gt; &lt;span class="n"&gt;indexSettings&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Optional&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;of&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;MyCustomEngine&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  IngestPlugin: Custom Processors
&lt;/h3&gt;

&lt;p&gt;For plugins that process documents during ingestion:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyIngestPlugin&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Plugin&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="nc"&gt;IngestPlugin&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Processor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Factory&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;getProcessors&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Processor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Parameters&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonMap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="s"&gt;"my_processor"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;factories&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;MyIngestProcessor&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use in pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;PUT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/_ingest/pipeline/my_pipeline&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"processors"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"my_processor"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"field"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"content"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Real-World Example: The Search Relevance Plugin
&lt;/h2&gt;

&lt;p&gt;OpenSearch's own &lt;strong&gt;search-relevance plugin&lt;/strong&gt; demonstrates these concepts in action. It provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Custom query types for A/B testing search relevance&lt;/li&gt;
&lt;li&gt;Custom aggregations for metrics collection&lt;/li&gt;
&lt;li&gt;REST endpoints to manage experiments&lt;/li&gt;
&lt;li&gt;System indexes (prefixed with &lt;code&gt;.plugins-search-rel-&lt;/code&gt;) to store experiment state&lt;/li&gt;
&lt;li&gt;Concurrent search request deciders (OpenSearch 2.17+) for custom query execution strategies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The plugin is battle-tested in production, used by teams optimizing ranking and relevance across massive datasets.&lt;/p&gt;

&lt;h2&gt;
  
  
  System Indexes: How Plugins Store Their Own State
&lt;/h2&gt;

&lt;p&gt;Most non-trivial plugins need to persist data. Rather than requiring external storage, they use &lt;strong&gt;system indexes&lt;/strong&gt; within OpenSearch itself.&lt;/p&gt;

&lt;p&gt;System indexes are prefixed with &lt;code&gt;.plugins-&lt;/code&gt; or &lt;code&gt;.opendistro-&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.plugins-search-rel-&amp;lt;version&amp;gt;-experiments
.plugins-search-rel-&amp;lt;version&amp;gt;-notes
.plugins-ml-config
.opendistro-job-scheduler-lock
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The challenge: how do you evolve the schema without breaking existing deployments?&lt;/p&gt;

&lt;p&gt;OpenSearch plugins use a &lt;strong&gt;schema versioning pattern&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="no"&gt;SCHEMA_VERSION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"1"&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;ensureIndexInitialized&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(!&lt;/span&gt;&lt;span class="n"&gt;indexExists&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;createIndex&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Object&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;indexMeta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;getIndexMeta&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;currentVersion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;indexMeta&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOrDefault&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"schema_version"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"0"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(!&lt;/span&gt;&lt;span class="n"&gt;currentVersion&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;equals&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="no"&gt;SCHEMA_VERSION&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;migrateSchema&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;currentVersion&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="no"&gt;SCHEMA_VERSION&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;migrateSchema&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;fromVersion&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;toVersion&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Use Put Mapping API to add new fields (additive only)&lt;/span&gt;
    &lt;span class="c1"&gt;// Never remove or change existing field types&lt;/span&gt;
    &lt;span class="n"&gt;putMapping&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;newFields&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Old documents coexist with new schema&lt;/li&gt;
&lt;li&gt;Upgrades are backwards compatible&lt;/li&gt;
&lt;li&gt;No downtime required for schema evolution&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Performance and Reliability Considerations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Startup Time
&lt;/h3&gt;

&lt;p&gt;Each plugin adds to startup time. Large plugins or plugins that do heavy initialization can slow cluster startup. Monitor this in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Class Loader Memory
&lt;/h3&gt;

&lt;p&gt;Each plugin gets its own class loader, holding copies of loaded classes in memory. Many plugins = higher memory footprint. Keep plugin count reasonable.&lt;/p&gt;

&lt;h3&gt;
  
  
  API Stability
&lt;/h3&gt;

&lt;p&gt;OpenSearch's plugin APIs are versioned with OpenSearch itself. When OpenSearch releases a major version, plugins must recompile and test. This is by design: it ensures plugins stay compatible with core.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security
&lt;/h3&gt;

&lt;p&gt;Plugins run in the same JVM as OpenSearch core. A malicious or buggy plugin can crash the entire node. Only install plugins from trusted sources. In multi-tenant environments, consider network isolation or separate clusters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building Your Own Plugin: Where to Start
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Clone the plugin template:&lt;/strong&gt; OpenSearch provides &lt;code&gt;plugin-template&lt;/code&gt; repository&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement your extension point&lt;/strong&gt; (SearchPlugin, ActionPlugin, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write tests&lt;/strong&gt; — use OpenSearch's testing framework&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build the .zip&lt;/strong&gt; — &lt;code&gt;gradle build&lt;/code&gt; produces the artifact&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Install locally&lt;/strong&gt; — &lt;code&gt;./bin/opensearch-plugin install file://...&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test end-to-end&lt;/strong&gt; — verify your REST endpoint/query/aggregation works&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Publish&lt;/strong&gt; — host on artifact repository or GitHub Releases&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;OpenSearch plugins are not magic. They're well-structured Java code that hooks into OpenSearch via extension points. Understanding this architecture demystifies plugin behavior, helps you troubleshoot issues, and opens the door to building custom extensions.&lt;/p&gt;

&lt;p&gt;Whether you're optimizing search relevance, integrating with custom systems, or building observability tooling, the plugin architecture gives you the hooks you need without compromising core stability.&lt;/p&gt;

&lt;p&gt;The next time a plugin breaks after an upgrade, you'll know exactly where to look. And when you need to build one, you'll have a mental model of how the pieces fit together.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;About the Author&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I'm Prithvi S, Staff Software Engineer at Cloudera and Open Source Enthusiast. I work on data systems, LLM-powered applications, and large-scale architectures. Follow my work on GitHub: &lt;a href="https://github.com/iprithv" rel="noopener noreferrer"&gt;https://github.com/iprithv&lt;/a&gt;&lt;/p&gt;

</description>
      <category>iceberg</category>
      <category>data</category>
      <category>architecture</category>
      <category>database</category>
    </item>
    <item>
      <title>The Credential Vending Revolution: How Polaris Eliminates Long-Lived Keys</title>
      <dc:creator>Prithvi S</dc:creator>
      <pubDate>Thu, 23 Apr 2026 09:24:08 +0000</pubDate>
      <link>https://dev.to/iprithv/the-credential-vending-revolution-how-polaris-eliminates-long-lived-keys-4h6n</link>
      <guid>https://dev.to/iprithv/the-credential-vending-revolution-how-polaris-eliminates-long-lived-keys-4h6n</guid>
      <description>&lt;h2&gt;
  
  
  The Problem Nobody Wants to Talk About
&lt;/h2&gt;

&lt;p&gt;You're a data engineer at a mid-sized company. Your team needs access to production data for analytics, ML pipelines, and ad-hoc queries. So you do what everyone does: you create long-lived AWS credentials (access key + secret), store them in a vault (or worse, environment variables), and distribute them to your team.&lt;/p&gt;

&lt;p&gt;Then you pray.&lt;/p&gt;

&lt;p&gt;You pray nobody copies them to Slack. You pray an engineer doesn't accidentally commit them to GitHub. You pray that when someone leaves, you remember to rotate them. You pray a compromised machine doesn't expose them to attackers.&lt;/p&gt;

&lt;p&gt;This is the status quo. And it's broken.&lt;/p&gt;

&lt;p&gt;For years, data catalogs have accepted this reality: want to access data? Here's a credential. Use it however you want. Rotation? Access control? Audit trails? Maybe in the next version.&lt;/p&gt;

&lt;p&gt;Apache Polaris just threw that playbook in the trash.&lt;/p&gt;

&lt;p&gt;Instead of handing out credentials, Polaris &lt;em&gt;mints them on demand&lt;/em&gt;. Every request for data gets a fresh, short-lived token scoped to exactly what's needed. No long-lived keys. No distribution. No prayer.&lt;/p&gt;

&lt;p&gt;This is credential vending. And it's about to change how we think about data security.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Credential Vending?
&lt;/h2&gt;

&lt;p&gt;Credential vending is simple in concept but elegant in execution: instead of pre-issuing static credentials, a system dynamically generates temporary, scoped credentials when you request access to data.&lt;/p&gt;

&lt;p&gt;Here's how it works in Polaris:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;You request access:&lt;/strong&gt; A Spark engine asks Polaris for permission to read a specific table&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Polaris authorizes:&lt;/strong&gt; It checks your identity, verifies your role, confirms you have &lt;code&gt;TABLE_READ_DATA&lt;/code&gt; privilege&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Polaris mints a credential:&lt;/strong&gt; It calls AWS STS, GCS token service, or Azure token service and gets a temporary credential&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Polaris scopes it:&lt;/strong&gt; That credential is locked to the specific table path, read-only, and expires in ~15 minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Your engine uses it:&lt;/strong&gt; Spark gets the scoped token, reads exactly what it needs, then the credential expires automatically&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No long-lived keys. No distribution. No rotation headaches.&lt;/p&gt;

&lt;p&gt;The genius is in the details.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters: The Security Cascade
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. No Long-Lived Credentials to Steal
&lt;/h3&gt;

&lt;p&gt;Traditional approach: You create an AWS access key with read+write permissions to your S3 bucket. You give it to 20 engineers. It lives in vaults, notebooks, CI/CD pipelines. Anywhere there's a copy, there's a vulnerability.&lt;/p&gt;

&lt;p&gt;Attack surface = number of copies × time each copy exists.&lt;/p&gt;

&lt;p&gt;Polaris approach: Your team never touches long-lived credentials. The only keys that exist are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Polaris' own cloud provider credentials (stored securely, rotated regularly)&lt;/li&gt;
&lt;li&gt;Temporary tokens minted per-request, valid for 15 minutes, then deleted&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Attack surface = 1 system × 15 minutes at a time.&lt;/p&gt;

&lt;p&gt;That's a 1000x reduction in exposure.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Instant Revocation
&lt;/h3&gt;

&lt;p&gt;With long-lived credentials, revoking access means updating IAM policies, rotating keys, or waiting for vault secrets to refresh. By then, a compromised key might already be in use.&lt;/p&gt;

&lt;p&gt;With credential vending, revocation is instant: Polaris stops issuing credentials for that principal, and their next data request fails immediately. No key in circulation. No 15-minute grace period for attackers to exploit.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Path-Level Scoping
&lt;/h3&gt;

&lt;p&gt;Your Spark job needs to read &lt;code&gt;s3://data-lake/customers/transactions&lt;/code&gt;. With traditional credentials, you'd give it broad S3 access. Polaris? It mints a credential valid only for that exact table path.&lt;/p&gt;

&lt;p&gt;Even if that credential leaks, an attacker can only read transactions, not your employee records, financial data, or anything else in the bucket.&lt;/p&gt;

&lt;p&gt;Write operations get the same treatment: &lt;code&gt;TABLE_WRITE_DATA&lt;/code&gt; privilege generates a credential that can &lt;em&gt;only&lt;/em&gt; write to that specific table, not drop it, not truncate it, not write to other tables.&lt;/p&gt;

&lt;p&gt;Privilege mapped to cloud permissions. Boundaries enforced at the storage layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Compliance &amp;amp; Audit Trail
&lt;/h3&gt;

&lt;p&gt;Regulators (GDPR, HIPAA, SOX) love paper trails. Credential vending creates one automatically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every credential vend is logged with who requested it, what table, when it was issued, when it expired&lt;/li&gt;
&lt;li&gt;You can correlate data access with user identity without relying on IAM logs that are often delayed or incomplete&lt;/li&gt;
&lt;li&gt;Breaches are traceable: "Which credentials were active when this data left the building?" Answer: the ones issued to that specific user for that specific 15-minute window&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No more "we distributed keys, so we don't know who accessed what" handwaving.&lt;/p&gt;




&lt;h2&gt;
  
  
  Under the Hood: How Polaris Mints Credentials
&lt;/h2&gt;

&lt;p&gt;Let's get technical. Here's the flow for an S3-backed Polaris catalog:&lt;/p&gt;

&lt;h3&gt;
  
  
  The Request
&lt;/h3&gt;

&lt;p&gt;A Trino query hits your Polaris catalog asking to read table &lt;code&gt;prod.warehouse.users&lt;/code&gt;. Polaris receives:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your identity (service principal or user)&lt;/li&gt;
&lt;li&gt;The table you want to access&lt;/li&gt;
&lt;li&gt;The operation (read, write, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Authorization Check
&lt;/h3&gt;

&lt;p&gt;Polaris checks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Does this principal have a role assigned?&lt;/li&gt;
&lt;li&gt;Does that role have &lt;code&gt;TABLE_READ_DATA&lt;/code&gt; privilege for this table (or its parent namespace)?&lt;/li&gt;
&lt;li&gt;If yes, proceed. If no, fail fast.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is enforced by Polaris' two-tier RBAC model: Principal Roles (identity) separate from Catalog Roles (permissions). More on that in a future post, but the key insight: authorization happens before any credential is minted.&lt;/p&gt;

&lt;h3&gt;
  
  
  Credential Minting
&lt;/h3&gt;

&lt;p&gt;Assuming authorization passes, Polaris looks up the storage configuration for this catalog. For S3, it has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An AWS account ID&lt;/li&gt;
&lt;li&gt;A role ARN (e.g., &lt;code&gt;arn:aws:iam::123456789:role/polaris-data-lake&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;An external ID (for added security)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Polaris calls &lt;code&gt;STS:AssumeRole&lt;/code&gt; with parameters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Role ARN&lt;/li&gt;
&lt;li&gt;Duration: 900 seconds (15 minutes)&lt;/li&gt;
&lt;li&gt;Session policy: A JSON policy restricting the token to &lt;code&gt;s3:GetObject&lt;/code&gt; on paths matching &lt;code&gt;s3://data-lake/prod/warehouse/users/*&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AWS returns a temporary security credential (access key, secret key, session token).&lt;/p&gt;

&lt;h3&gt;
  
  
  Scoping the Credential
&lt;/h3&gt;

&lt;p&gt;Here's where it gets clever. Polaris doesn't just pass through the STS response. It crafts a session policy that restricts the token further:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"s3:GetObject"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3:ListBucket"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:s3:::data-lake/prod/warehouse/users/*"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For &lt;code&gt;TABLE_WRITE_DATA&lt;/code&gt;, the policy includes &lt;code&gt;s3:PutObject&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"s3:GetObject"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3:ListBucket"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3:PutObject"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:s3:::data-lake/prod/warehouse/users/*"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice: &lt;code&gt;s3:DeleteObject&lt;/code&gt; is never included. You can write to the table, but you can't delete it or its backing files. Polaris itself controls deletes through atomic metadata operations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Response to Engine
&lt;/h3&gt;

&lt;p&gt;Polaris returns the temporary credential to your Spark/Trino/Flink engine. The engine uses it to read data. After 15 minutes, the token expires automatically.&lt;/p&gt;

&lt;p&gt;No revocation needed. No cleanup. No leftover keys.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Federation Game Changer (v1.3.0)
&lt;/h2&gt;

&lt;p&gt;Here's where it gets even more interesting.&lt;/p&gt;

&lt;p&gt;Many organizations don't run pure Iceberg. You have Snowflake for analytics, Delta Lake in Databricks, Hudi on Hadoop, and Iceberg in your data lake. Managing credentials across all of them is a nightmare.&lt;/p&gt;

&lt;p&gt;Polaris v1.3.0 introduces federated credential vending.&lt;/p&gt;

&lt;p&gt;Instead of each external system managing its own credentials, Polaris can mint credentials &lt;em&gt;on behalf&lt;/em&gt; of external catalogs. Your Snowflake-to-Iceberg migration? Polaris handles credential vending for both. Your Databricks Delta table accessed through Polaris? Same story.&lt;/p&gt;

&lt;p&gt;This is huge for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data mesh architectures:&lt;/strong&gt; One source of truth for credential vending across multiple catalog types&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Migrations:&lt;/strong&gt; Seamlessly bridge old systems (Snowflake, Glue) with new ones (Iceberg) without credential sprawl&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-cloud setups:&lt;/strong&gt; Mint GCS credentials for BigQuery, S3 credentials for Iceberg, all from a single Polaris instance&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Performance: The Caching Question
&lt;/h2&gt;

&lt;p&gt;"But doesn't minting a credential for every request add latency?"&lt;/p&gt;

&lt;p&gt;Yes. Each STS call takes 100-200ms.&lt;/p&gt;

&lt;p&gt;Polaris solves this with intelligent caching:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For repeated requests from the same principal to the same table, Polaris reuses the cached credential (until near expiration)&lt;/li&gt;
&lt;li&gt;This reduces cloud provider API calls significantly&lt;/li&gt;
&lt;li&gt;The tradeoff: earlier revocation (e.g., if permissions change mid-session) requires a cache flush&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For most workloads (batch jobs, dashboards, recurring queries), this is a net win. Latency is imperceptible; security is massively improved.&lt;/p&gt;




&lt;h2&gt;
  
  
  Implementing Credential Vending in Your Catalog
&lt;/h2&gt;

&lt;p&gt;If you're building a catalog or evaluating options, here's what to look for:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cloud-native credential generation:&lt;/strong&gt; Does the system call your cloud provider's token service (STS, GCS, Azure), or does it generate its own tokens? Cloud-native is better (leverages existing IAM, auditable).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scoping mechanism:&lt;/strong&gt; Are credentials scoped to table paths, operations, or both? Path + operation = maximum security.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Expiration:&lt;/strong&gt; How short can tokens be? 15 minutes is ideal for security; anything longer risks exposing stale credentials.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Caching strategy:&lt;/strong&gt; How does the system balance revocation latency with performance? Intelligent caching (by principal + table) is the sweet spot.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multi-cloud support:&lt;/strong&gt; Do you need GCS, S3, and Azure all at once? Credential vending should work across cloud providers.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Polaris nails all five. That's why it's becoming the standard for open-source Iceberg deployments.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;Credential vending is the foundation. On top of it, Polaris builds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Two-tier RBAC for fine-grained access control&lt;/li&gt;
&lt;li&gt;OPA integration for externalized policies&lt;/li&gt;
&lt;li&gt;Metrics reporting for observability&lt;/li&gt;
&lt;li&gt;Generic table support for non-Iceberg formats&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But credential vending is the security core. It's why Polaris is uniquely positioned for zero-trust data architectures.&lt;/p&gt;

&lt;p&gt;If your organization is scaling data access, if compliance is a concern, or if you're tired of rotating long-lived credentials, Polaris' credential vending approach is worth the migration.&lt;/p&gt;

&lt;p&gt;No more praying.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Questions? Thoughts on credential vending, Polaris, or data security architecture?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Drop a comment below or find me on GitHub:&lt;/strong&gt; &lt;a href="https://github.com/iprithv" rel="noopener noreferrer"&gt;https://github.com/iprithv&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;About the author:&lt;/strong&gt; I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. Follow my work on &lt;a href="https://github.com/iprithv" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>iceberg</category>
      <category>data</category>
      <category>architecture</category>
      <category>database</category>
    </item>
    <item>
      <title>Elasticsearch Cluster Health 101: Understanding Your Distributed System's Vital Signs</title>
      <dc:creator>Prithvi S</dc:creator>
      <pubDate>Tue, 21 Apr 2026 14:29:18 +0000</pubDate>
      <link>https://dev.to/iprithv/elasticsearch-cluster-health-101-understanding-your-distributed-systems-vital-signs-1kl6</link>
      <guid>https://dev.to/iprithv/elasticsearch-cluster-health-101-understanding-your-distributed-systems-vital-signs-1kl6</guid>
      <description>&lt;p&gt;You ship your Elasticsearch cluster to production. Traffic hits it. Three hours later, your monitoring dashboard flashes yellow. Your heart sinks. &lt;em&gt;What does that mean? Are you in trouble? Should you wake up the on-call engineer at 2 AM?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This post teaches you to read your cluster's health like a doctor reads vital signs. By the end, you'll understand what GREEN, YELLOW, and RED actually mean, why your cluster sometimes needs time to heal itself, and how to spot real problems before they become disasters.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is Cluster Health? The Three States
&lt;/h2&gt;

&lt;p&gt;Every Elasticsearch cluster has a health status. It's not a guess. It's a concrete signal that tells you whether your data is safe.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;GET&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/_cluster/health&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"cluster_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"production"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"green"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timed_out"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"number_of_nodes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"number_of_data_nodes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"active_primary_shards"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"active_shards"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;36&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"relocating_shards"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"initializing_shards"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"unassigned_shards"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"delayed_unassigned_shards"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"number_of_pending_tasks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"number_of_in_flight_fetch"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"task_max_waiting_in_queue_millis"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"active_shards_percent_as_number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;100.0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;GREEN&lt;/strong&gt; means all is well. Every primary shard has a replica. Every index is fully replicated. You can sleep soundly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;YELLOW&lt;/strong&gt; means something is missing, but it's not critical &lt;em&gt;yet&lt;/em&gt;. Your data is still readable and searchable. Every primary shard exists. But not all replicas are assigned. This usually happens when you lose a node, and Elasticsearch hasn't finished rebalancing yet. You have time to fix it, but replicas are your safety net. If another node fails while you're yellow, you lose data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RED&lt;/strong&gt; means you're in trouble. At least one primary shard is missing. Data is gone or unreachable. Your cluster cannot fully serve requests. This is the emergency light. Time to act.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture Behind Health: Cluster Coordination
&lt;/h2&gt;

&lt;p&gt;To understand why your cluster gets sick, you need to understand how it stays healthy.&lt;/p&gt;

&lt;p&gt;Elasticsearch is fundamentally distributed. Your data is split across multiple nodes. Each node is independent. But they need to agree on one critical thing: &lt;em&gt;where is my data?&lt;/em&gt; This is the job of the cluster coordinator (master node).&lt;/p&gt;

&lt;h3&gt;
  
  
  The Master Node: The Orchestrator
&lt;/h3&gt;

&lt;p&gt;One node in your cluster is elected master. This node makes all the big decisions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Where do shards live?&lt;/li&gt;
&lt;li&gt;Is node X still alive, or did it fail?&lt;/li&gt;
&lt;li&gt;When a node joins, where do its shards go?&lt;/li&gt;
&lt;li&gt;Which indices can be created or deleted?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The master maintains the cluster state, a constantly updated map of the cluster. This map says: "Shard 0 of index-2026-04 is on node-1 (primary), node-2 (replica), and node-3 (replica)."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why does this matter?&lt;/strong&gt; Because if the master dies, the cluster needs to elect a new one. And if you don't have enough nodes to reach a quorum, the cluster freezes to prevent split-brain (where two masters disagree and corrupt your data).&lt;/p&gt;

&lt;h3&gt;
  
  
  Master Election: The Quorum Rule
&lt;/h3&gt;

&lt;p&gt;Elasticsearch uses quorum voting. You need at least (N/2 + 1) master-eligible nodes to have a working master.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1 master-eligible node: no quorum, single point of failure&lt;/li&gt;
&lt;li&gt;2 master-eligible nodes: no quorum (need 2/2 + 1 = 2, but split vote = tie)&lt;/li&gt;
&lt;li&gt;3 master-eligible nodes: quorum = 2 (safe, survives 1 failure)&lt;/li&gt;
&lt;li&gt;5 master-eligible nodes: quorum = 3 (safe, survives 2 failures)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why production clusters run 3 or 5 master-eligible nodes, not 2. A 2-node cluster can't form a quorum if one node fails. Your cluster goes RED and stops serving requests.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Recommended Production Setup:
- 3 master-eligible nodes (dedicated, small machines)
- 3+ data nodes (store and search data, large machines)
- 1+ coordinating nodes (optional, route queries, aggregate results)

This setup survives any single node failure.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Shard Allocation: How Data Spreads
&lt;/h2&gt;

&lt;p&gt;Your index has 3 primary shards and 2 replicas per primary. That's 9 shards total (3 primary + 6 replica). Elasticsearch's job is to spread these 9 shards across your nodes so that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No primary and replica on the same node&lt;/strong&gt; (otherwise a single node failure loses data)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replicas spread across different nodes&lt;/strong&gt; (fault tolerance)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load balanced&lt;/strong&gt; (roughly equal shard count per node)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When everything works, this happens automatically. When a node fails, Elasticsearch:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Detects the failure (no heartbeat for 30 seconds)&lt;/li&gt;
&lt;li&gt;Marks the node as dead&lt;/li&gt;
&lt;li&gt;Reassigns its shards to other nodes&lt;/li&gt;
&lt;li&gt;Creates new replicas to restore redundancy&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This process is called rebalancing. It takes time. A large index might take minutes or hours to fully rebalance. During this time, your cluster is YELLOW (replicas missing), but still operational.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Health Scenarios
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Scenario 1: New Node Joins
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before: 2 nodes, 6 shards each (fully replicated, GREEN)
New node joins
Action: Master rebalances, shards move to new node
During: YELLOW (shards initializing on new node)
After: GREEN (shards reassigned, balanced)
Timeline: Minutes to hours depending on shard size
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scenario 2: Node Failure
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before: 3 nodes, GREEN (all shards have replicas)
Node 2 crashes (network partition, power failure)
Immediately: YELLOW (node 2's shards gone, replicas missing)
Action: Master promotes replicas on nodes 1 and 3 to primary
       Creates new replicas on nodes 1 and 3
During: YELLOW (replicas initializing)
After: GREEN (all shards have replicas again)
Timeline: Seconds (replica promotion) + minutes (replica creation)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scenario 3: Disk Full
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before: 3 nodes, GREEN
Node 1 disk reaches 85% capacity
Action: Elasticsearch refuses to assign new shards to node 1
Symptom: Some shards can't be assigned to node 1, cluster goes YELLOW
Fix: Delete old indices, or add disk space
After: Cluster rebalances, goes GREEN
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Reading the Health Endpoint: What Each Field Means
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;GET /_cluster/health&lt;/code&gt; API is your primary diagnostic tool. Here's what each field tells you:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;status&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;GREEN (all good), YELLOW (missing replicas), RED (missing primary)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;number_of_nodes&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Total nodes in cluster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;number_of_data_nodes&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Nodes that store data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;active_primary_shards&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Primary shards assigned and healthy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;active_shards&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Primary + replica shards assigned and healthy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;relocating_shards&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Shards currently moving to another node&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;initializing_shards&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Shards being created or recovered&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;unassigned_shards&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Shards that haven't been assigned to a node&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;delayed_unassigned_shards&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Shards waiting to be assigned (temporary delay)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;number_of_pending_tasks&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Master tasks waiting to be executed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Example: Degraded Cluster&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"yellow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"number_of_nodes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"active_primary_shards"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"active_shards"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"unassigned_shards"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"relocating_shards"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"initializing_shards"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Translation: 3 nodes, 12 primary shards assigned, but only 24 total shards assigned. That means 12 replicas are missing (unassigned). Also, 2 shards are moving, 4 are initializing. The cluster is rebalancing from a recent failure or node addition.&lt;/p&gt;

&lt;h2&gt;
  
  
  Diagnosing RED: Data Is Missing
&lt;/h2&gt;

&lt;p&gt;A RED cluster means at least one primary shard has no home. This is an emergency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Find the problematic shard:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;GET&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;_cat&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;shards&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="n"&gt;health&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;red&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows you which shards are unassigned. Look for entries with no node assignment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common causes:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Node failure with insufficient replicas&lt;/strong&gt; - If a node fails and you had zero replicas, the primary shard is lost&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fix: Restore from snapshot&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Disk full on all nodes&lt;/strong&gt; - Elasticsearch won't assign shards to nodes &amp;gt;85% full&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fix: Delete old indices, add disk space, or adjust disk threshold setting&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Allocation disabled&lt;/strong&gt; - Someone (usually during disaster recovery) disabled shard allocation&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fix: Re-enable with &lt;code&gt;PUT /_cluster/settings {"transient": {"cluster.routing.allocation.enable": "all"}}&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Too many relocating shards&lt;/strong&gt; - Master is overloaded trying to rebalance&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fix: Wait, or reduce concurrent recoveries with cluster settings&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Diagnosing YELLOW: Replicas Are Missing
&lt;/h2&gt;

&lt;p&gt;YELLOW is a warning, not a failure. You can still read and write. But you're one node failure away from RED.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check which indices are yellow:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;GET&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;_cat&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="n"&gt;health&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;yellow&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Check if it's stuck or still rebalancing:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;GET&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;_cluster&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;health&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="n"&gt;wait_for_status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;green&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This waits up to 5 minutes for the cluster to reach GREEN. If it times out, you're stuck yellow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why you might be stuck yellow:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Insufficient nodes&lt;/strong&gt; - You have 1 data node but 2 replicas per shard. No place to put the replicas&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fix: Add more nodes&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Allocation disabled&lt;/strong&gt; - Replicas won't be assigned if allocation is off&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fix: &lt;code&gt;PUT /_cluster/settings {"transient": {"cluster.routing.allocation.enable": "all"}}&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Allocation filters blocking replicas&lt;/strong&gt; - You set a filter that prevents replicas on certain nodes&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fix: Review allocation filtering rules&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Monitoring: Don't Just React, Anticipate
&lt;/h2&gt;

&lt;p&gt;Cluster health is reactive. It tells you what happened, not what will happen. For reliability, monitor proactively:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alert on these:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Status == RED (obvious, immediate incident)&lt;/li&gt;
&lt;li&gt;Status == YELLOW for &amp;gt;5 minutes (stuck rebalancing, investigate)&lt;/li&gt;
&lt;li&gt;Unassigned shards &amp;gt; 0 for &amp;gt;10 minutes&lt;/li&gt;
&lt;li&gt;Disk usage &amp;gt;85% on any data node&lt;/li&gt;
&lt;li&gt;Heap usage &amp;gt;80% on any node&lt;/li&gt;
&lt;li&gt;Relocating shards &amp;gt; 5 (recovery is slow)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Useful dashboard queries:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;GET&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;_cat&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;heap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;percent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;disk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;percent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;cpu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;load_1m&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows node-by-node health: heap usage, disk usage, CPU, load. Red flags: heap &amp;gt;80%, disk &amp;gt;85%.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;GET&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;_nodes&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;stats&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;jvm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;indices&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deep dive: garbage collection pauses, segment count, cache hit rates. Useful for performance issues hiding behind a GREEN cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes That Destroy Reliability
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Mistake 1: Single master-eligible node&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You think it works fine until that one node fails&lt;/li&gt;
&lt;li&gt;Cluster elects no master, goes RED, stops serving requests&lt;/li&gt;
&lt;li&gt;Fix: Always run 3+ master-eligible nodes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Mistake 2: Ignoring YELLOW for days&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"It's yellow, but traffic is fine!" you say&lt;/li&gt;
&lt;li&gt;Then a second node fails, cluster goes RED&lt;/li&gt;
&lt;li&gt;Fix: Investigate YELLOW immediately, restore replicas&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Mistake 3: All shards on one node&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You didn't specify replicas or shard allocation rules&lt;/li&gt;
&lt;li&gt;One node failure = RED cluster&lt;/li&gt;
&lt;li&gt;Fix: Use allocation awareness, rack awareness, or zone awareness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Mistake 4: Disabling shard allocation and forgetting to re-enable&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You disabled it during maintenance and moved on&lt;/li&gt;
&lt;li&gt;Weeks later, replicas are still unassigned&lt;/li&gt;
&lt;li&gt;Fix: Audit allocation settings regularly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Mistake 5: Not understanding recovery time&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You expect replicas to be created instantly&lt;/li&gt;
&lt;li&gt;But recovery depends on network bandwidth, index size, merge rate&lt;/li&gt;
&lt;li&gt;You panic and manually delete/recreate indices, making it worse&lt;/li&gt;
&lt;li&gt;Fix: Understand recovery SLOs for your cluster size&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Putting It Together: Your First Cluster Health Audit
&lt;/h2&gt;

&lt;p&gt;Here's what to do right now:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check overall health&lt;/span&gt;
curl http://localhost:9200/_cluster/health?pretty

&lt;span class="c"&gt;# If not GREEN, check which indices are affected&lt;/span&gt;
curl http://localhost:9200/_cat/indices?health&lt;span class="o"&gt;=&lt;/span&gt;yellow&amp;amp;v

&lt;span class="c"&gt;# Check which shards are unassigned&lt;/span&gt;
curl http://localhost:9200/_cat/shards?health&lt;span class="o"&gt;=&lt;/span&gt;yellow&amp;amp;v

&lt;span class="c"&gt;# Check node status&lt;/span&gt;
curl http://localhost:9200/_cat/nodes?v&amp;amp;h&lt;span class="o"&gt;=&lt;/span&gt;name,ip,heap.percent,disk.percent,cpu

&lt;span class="c"&gt;# Check allocation settings&lt;/span&gt;
curl http://localhost:9200/_cluster/settings?pretty
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you see YELLOW and unassigned shards, it's usually one of these:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A node is recovering (wait 5-10 min, check again)&lt;/li&gt;
&lt;li&gt;You don't have enough nodes for your replica count (add nodes)&lt;/li&gt;
&lt;li&gt;Disk is full (delete old data)&lt;/li&gt;
&lt;li&gt;Allocation is disabled (re-enable it)&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion: Health Is Visibility
&lt;/h2&gt;

&lt;p&gt;Cluster health is not a number to ignore. It's your window into the distributed system running underneath your search and analytics.&lt;/p&gt;

&lt;p&gt;GREEN means you're safe. YELLOW means you're vulnerable. RED means you have a real problem.&lt;/p&gt;

&lt;p&gt;The key insight: &lt;strong&gt;Elasticsearch recovers automatically most of the time.&lt;/strong&gt; Your job is to understand what's happening, monitor proactively, and know when to intervene.&lt;/p&gt;

&lt;p&gt;Next step: Learn about shard allocation strategies and how to scale your cluster without triggering cascading failures.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;About the Author&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I'm Prithvi S, Staff Software Engineer at Cloudera and Open Source Enthusiast. Follow my work on GitHub: &lt;a href="https://github.com/iprithv" rel="noopener noreferrer"&gt;https://github.com/iprithv&lt;/a&gt;&lt;/p&gt;

</description>
      <category>elasticsearch</category>
      <category>search</category>
      <category>database</category>
      <category>analytics</category>
    </item>
    <item>
      <title>Building a Search Quality Evaluation Pipeline with OpenSearch Search Relevance</title>
      <dc:creator>Prithvi S</dc:creator>
      <pubDate>Mon, 20 Apr 2026 00:34:27 +0000</pubDate>
      <link>https://dev.to/iprithv/building-a-search-quality-evaluation-pipeline-with-opensearch-search-relevance-42n</link>
      <guid>https://dev.to/iprithv/building-a-search-quality-evaluation-pipeline-with-opensearch-search-relevance-42n</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1551288049-bebda4e38f71%3Fw%3D800" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1551288049-bebda4e38f71%3Fw%3D800" alt="Search Metrics Dashboard" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;You've built a search engine. Users can query your data. Results come back. But here's the uncomfortable question: are the results &lt;em&gt;good&lt;/em&gt;?&lt;/p&gt;

&lt;p&gt;This isn't about whether the feature works. It works. But does it rank the most relevant documents first? Do users find what they're looking for? Are you optimizing for the right metrics?&lt;/p&gt;

&lt;p&gt;If you're operating OpenSearch at scale, you've probably felt this pain. Search quality isn't a one-time configuration problem. It's an ongoing optimization challenge. You need to measure it, experiment with it, and improve it systematically.&lt;/p&gt;

&lt;p&gt;That's where the OpenSearch Search Relevance plugin comes in.&lt;/p&gt;

&lt;p&gt;In this post, I'll walk you through building an end-to-end search quality evaluation pipeline using the Search Relevance plugin. By the end, you'll understand how to create representative query sets, run controlled experiments, collect human judgments, compute relevance metrics, and iterate on your search configuration until results actually matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Search Quality Problem
&lt;/h2&gt;

&lt;p&gt;Before diving into the solution, let's frame the problem clearly.&lt;/p&gt;

&lt;p&gt;Search quality has multiple dimensions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Relevance&lt;/strong&gt; - Does the top result match what the user searched for?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Completeness&lt;/strong&gt; - Are all relevant documents in the top-K results?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ranking&lt;/strong&gt; - Are more relevant docs ranked higher than less relevant ones?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Precision&lt;/strong&gt; - What fraction of returned results are actually useful?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recall&lt;/strong&gt; - What fraction of useful documents did we find?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The BM25 algorithm (OpenSearch's default) is good out of the box. But "good" isn't "perfect." And what's perfect for one use case (e-commerce product search) might be terrible for another (medical research papers).&lt;/p&gt;

&lt;p&gt;You need a way to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Define what "good" means for your domain&lt;/li&gt;
&lt;li&gt;Measure it quantitatively&lt;/li&gt;
&lt;li&gt;Test changes before deploying&lt;/li&gt;
&lt;li&gt;Track improvements over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is exactly what a search quality evaluation pipeline does.&lt;/p&gt;

&lt;h2&gt;
  
  
  Meet the OpenSearch Search Relevance Plugin
&lt;/h2&gt;

&lt;p&gt;The Search Relevance plugin is part of the opensearch-project ecosystem. It's designed specifically to solve this problem.&lt;/p&gt;

&lt;p&gt;At its core, it orchestrates four key components:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Query Sets&lt;/strong&gt; - Representative questions or search terms for your domain&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Search Configurations&lt;/strong&gt; - Different index analyzers, query types, and boosting settings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Experiments&lt;/strong&gt; - Controlled comparisons between two search configurations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Judgments&lt;/strong&gt; - Human-provided relevance labels for query-document pairs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metrics&lt;/strong&gt; - Computed evaluation scores: nDCG, precision, recall, MRR&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The beauty is that it connects all of these together into a coherent workflow. You don't need to glue together five different tools. It's all built in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Build Your Query Set
&lt;/h2&gt;

&lt;p&gt;Everything starts with queries.&lt;/p&gt;

&lt;p&gt;A query set is a collection of representative search terms or questions for your domain. The quality of your query set directly affects the quality of your evaluation.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to Design a Query Set
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Think like your users.&lt;/strong&gt; What would they actually search for?&lt;/p&gt;

&lt;p&gt;For an e-commerce search engine, examples might be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"blue running shoes size 10"&lt;/li&gt;
&lt;li&gt;"wireless headphones under 100"&lt;/li&gt;
&lt;li&gt;"coffee maker for 2 people"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a documentation search:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"how to configure SSL certificates"&lt;/li&gt;
&lt;li&gt;"what is a shard"&lt;/li&gt;
&lt;li&gt;"troubleshooting connection timeouts"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a code search:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"implement BFS algorithm"&lt;/li&gt;
&lt;li&gt;"parse JSON to object"&lt;/li&gt;
&lt;li&gt;"handle file not found error"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Include variety.&lt;/strong&gt; Your query set should cover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Short queries (1-2 terms) and long queries (5+ terms)&lt;/li&gt;
&lt;li&gt;Common searches and edge cases&lt;/li&gt;
&lt;li&gt;Different intent types (navigation, informational, transactional)&lt;/li&gt;
&lt;li&gt;Different domains if your search spans multiple categories&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Size matters.&lt;/strong&gt; A good query set has 50-300 queries depending on your domain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;50-100: Initial evaluation (fast iteration)&lt;/li&gt;
&lt;li&gt;100-200: Standard evaluation for most teams&lt;/li&gt;
&lt;li&gt;200+: Large-scale benchmarking across multiple configurations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Create it incrementally.&lt;/strong&gt; Start small, run experiments, learn what queries are most impactful, expand.&lt;/p&gt;

&lt;h3&gt;
  
  
  Storing Your Query Set
&lt;/h3&gt;

&lt;p&gt;In the Search Relevance plugin, query sets are stored as OpenSearch documents. Here's a conceptual example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ecommerce-base-v1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"E-commerce Base Query Set"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"100 representative queries for product search"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"queries"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"q001"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"blue running shoes"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"q002"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"wireless headphones"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each query has an ID and text. Simple. You can create query sets via the OpenSearch API or the Dashboards UI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Define Search Configurations
&lt;/h2&gt;

&lt;p&gt;A search configuration is a snapshot of how you want to search: analyzers, query types, field boosting, synonym expansion, and more.&lt;/p&gt;

&lt;p&gt;Think of it as: "This is one way to search."&lt;/p&gt;

&lt;h3&gt;
  
  
  What Goes Into a Configuration
&lt;/h3&gt;

&lt;p&gt;Here are common things you might tune:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analyzers&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Standard analyzer vs. custom analyzer with stemming&lt;/li&gt;
&lt;li&gt;Language-specific analyzers (English, French, etc.)&lt;/li&gt;
&lt;li&gt;Phonetic analysis for typo tolerance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Query Types&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;BM25 query (default, term frequency + IDF)&lt;/li&gt;
&lt;li&gt;Match phrase query (exact phrase matching)&lt;/li&gt;
&lt;li&gt;Bool query with SHOULD/MUST/FILTER clauses&lt;/li&gt;
&lt;li&gt;Multi-match across multiple fields&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Field Boosting&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Title field: 2x boost (more important)&lt;/li&gt;
&lt;li&gt;Description field: 1x boost (baseline)&lt;/li&gt;
&lt;li&gt;Tags field: 0.5x boost (less important)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Query Parameters&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fuzziness (tolerating typos)&lt;/li&gt;
&lt;li&gt;Operator (AND vs. OR semantics)&lt;/li&gt;
&lt;li&gt;Minimum should match (for multi-clause queries)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Creating a Configuration
&lt;/h3&gt;

&lt;p&gt;Via API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;PUT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/search-config/_doc/config-v&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Default BM25"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Standard BM25 with field boosting"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"analyzer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"standard"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"query_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"multi_match"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"field_weights"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"tags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You typically create 2-3 configurations to compare:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Your baseline (current production)&lt;/li&gt;
&lt;li&gt;A proposed improvement (new analyzer or boosting strategy)&lt;/li&gt;
&lt;li&gt;(Optional) A radically different approach&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Step 3: Run an Experiment
&lt;/h2&gt;

&lt;p&gt;Now the magic happens. You tell the plugin: "Compare config A vs. config B using my query set. Execute all queries against both and show me which is better."&lt;/p&gt;

&lt;p&gt;The plugin does this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Iterate through each query&lt;/strong&gt; in your query set&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execute the query&lt;/strong&gt; against your OpenSearch index using config A&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execute the same query&lt;/strong&gt; against your index using config B&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capture the results&lt;/strong&gt; (top-K documents for each)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Store everything&lt;/strong&gt; indexed and ready for judgment&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is an experiment. The status is CREATED -&amp;gt; RUNNING -&amp;gt; COMPLETED.&lt;/p&gt;

&lt;p&gt;Once completed, you have paired results: for each query, you can see side-by-side what config A returned vs. what config B returned.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why This Matters
&lt;/h3&gt;

&lt;p&gt;You now have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reproducible, deterministic comparisons&lt;/li&gt;
&lt;li&gt;No randomness (same queries, same configs, same results every time)&lt;/li&gt;
&lt;li&gt;Side-by-side results for human evaluation&lt;/li&gt;
&lt;li&gt;A complete audit trail of what changed&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 4: Collect Judgments
&lt;/h2&gt;

&lt;p&gt;Here's where humans come in.&lt;/p&gt;

&lt;p&gt;Judgments are relevance labels. A human expert looks at a query and says: "This document is highly relevant" or "This document is not relevant."&lt;/p&gt;

&lt;p&gt;The Search Relevance plugin supports two types of judgments:&lt;/p&gt;

&lt;h3&gt;
  
  
  Explicit Judgments
&lt;/h3&gt;

&lt;p&gt;A human grades each query-document pair on a scale. Most common:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;0: Not relevant (wrong topic entirely)&lt;/li&gt;
&lt;li&gt;1: Somewhat relevant (tangentially related)&lt;/li&gt;
&lt;li&gt;2: Relevant (answers the query)&lt;/li&gt;
&lt;li&gt;3: Highly relevant (perfect match)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The plugin UI presents your experiment results (config A vs. B side-by-side) and lets judges assign these grades.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implicit Judgments
&lt;/h3&gt;

&lt;p&gt;Collect signals from user behavior:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Click-through rate (user clicked this result)&lt;/li&gt;
&lt;li&gt;Dwell time (user spent time reading this)&lt;/li&gt;
&lt;li&gt;Skip rate (user skipped this and clicked something lower)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For many teams, explicit judgments from a small pool of domain experts (5-20 people) is enough to get strong signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Compute Metrics
&lt;/h2&gt;

&lt;p&gt;Once you have judgments, the plugin computes relevance metrics:&lt;/p&gt;

&lt;h3&gt;
  
  
  nDCG (Normalized Discounted Cumulative Gain)
&lt;/h3&gt;

&lt;p&gt;This measures ranking quality. The intuition:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Relevant documents should rank high&lt;/li&gt;
&lt;li&gt;Position matters (higher positions worth more)&lt;/li&gt;
&lt;li&gt;Perfect ranking gets nDCG = 1.0&lt;/li&gt;
&lt;li&gt;Random ranking gets nDCG near 0.5&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Formula (simplified):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;nDCG&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;IDCG&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;relevance_grade&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;log2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;position&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;When to use:&lt;/strong&gt; Almost always. This is the gold standard for ranking quality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Precision and Recall
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Precision:&lt;/strong&gt; What fraction of top-K results were relevant?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Precision@10 = (# relevant in top 10) / 10&lt;/li&gt;
&lt;li&gt;Good for: User experience (are the top results useful?)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Recall:&lt;/strong&gt; What fraction of relevant documents did we find?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Recall@100 = (# relevant in top 100) / (# relevant total)&lt;/li&gt;
&lt;li&gt;Good for: Comprehensiveness (did we find everything?)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  MRR (Mean Reciprocal Rank)
&lt;/h3&gt;

&lt;p&gt;Average position of the first relevant document.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MRR = 1 / (average position of first relevant result)&lt;/li&gt;
&lt;li&gt;Good for: Cases where only the first result matters (navigation queries)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Which Metric to Track
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;nDCG&lt;/strong&gt;: Primary metric. Use nDCG@10 or nDCG@20 for most cases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Precision@K&lt;/strong&gt;: Secondary. Shows top-K quality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recall@K&lt;/strong&gt;: If comprehensiveness matters (e.g., search across entire document corpus).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MRR&lt;/strong&gt;: Only if navigational queries are critical.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 6: Iterate
&lt;/h2&gt;

&lt;p&gt;The experiment outputs metrics. You see that config B's nDCG is 0.78 while config A's is 0.72. Config B wins.&lt;/p&gt;

&lt;p&gt;Now what?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deploy config B.&lt;/strong&gt; Monitor search quality in production. But don't stop.&lt;/p&gt;

&lt;p&gt;Run the next experiment. Try:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A different analyzer&lt;/li&gt;
&lt;li&gt;Different field boosting&lt;/li&gt;
&lt;li&gt;Additional query logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Treat search quality like any other product: continuous improvement, guided by metrics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting It All Together: The Pipeline
&lt;/h2&gt;

&lt;p&gt;Here's the complete workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Design Query Set
   |
   v
2. Create Search Configurations (Baseline + Proposed)
   |
   v
3. Run Experiment (Execute queries for both configs)
   |
   v
4. Collect Judgments (Humans grade results)
   |
   v
5. Compute Metrics (nDCG, precision, recall, MRR)
   |
   v
6. Analyze Results (Which config wins? By how much?)
   |
   v
7. Deploy Winner &amp;amp; Monitor
   |
   v
8. Repeat (Design next experiment)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each cycle takes days to weeks (depending on judgment collection speed). But you're grounded in data, not guesses.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Tips
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Start small.&lt;/strong&gt; 50 queries, 2 configurations, 10 judges. Run the pipeline end-to-end. You'll learn what works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Judge consistently.&lt;/strong&gt; Train your judges. Create a judgment guide. Have judges re-evaluate a subset for inter-rater agreement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Track over time.&lt;/strong&gt; Keep historical metrics. Did nDCG improve? By how much? This builds confidence that you're moving in the right direction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Combine signals.&lt;/strong&gt; Use nDCG as your primary metric, but also check precision, recall, and MRR. Sometimes improvements in one metric hurt another.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automate where possible.&lt;/strong&gt; Explicit judgments require humans, but experiment execution, metric computation, and result analysis should all be automated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Version everything.&lt;/strong&gt; Query sets, configurations, experiments, judgments. Treat them like code: track versions, enable reproducibility, enable rollback.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenges You'll Face
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Judgment burden.&lt;/strong&gt; Grading 100 queries x 10 results per query = 1,000 judgments. At 30 seconds per judgment, that's 8 hours. Parallelize across judges. Use inter-rater agreement to validate quality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query set quality.&lt;/strong&gt; A bad query set produces meaningless metrics. Spend time upfront building representative queries. Validate with users or domain experts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Config comparison fairness.&lt;/strong&gt; Make sure both configs query the same data, same index, same relevance judgments. Isolate variables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Metric interpretation.&lt;/strong&gt; A 0.02 improvement in nDCG might be noise or might be significant. Track confidence intervals. Run multiple rounds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Example
&lt;/h2&gt;

&lt;p&gt;Let's say you run e-commerce search. Your current baseline achieves nDCG@10 = 0.68. Users complain that size/color variants aren't matching well.&lt;/p&gt;

&lt;p&gt;You hypothesize: "If I boost the size and color fields more, users will find exact matches faster."&lt;/p&gt;

&lt;p&gt;You create a new config with aggressive field boosting on size and color. Run an experiment with 100 queries and 15 judges.&lt;/p&gt;

&lt;p&gt;Results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Config A (baseline): nDCG@10 = 0.68, Precision@10 = 0.72&lt;/li&gt;
&lt;li&gt;Config B (boosted): nDCG@10 = 0.71, Precision@10 = 0.75&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Config B wins. You deploy it. Users are happier. Search quality improved by 4.4%.&lt;/p&gt;

&lt;p&gt;Next experiment: Can we improve recall without hurting precision? Try a different query operator. Repeat.&lt;/p&gt;

&lt;p&gt;Over 6 months, you've compounded these improvements. Your nDCG went from 0.68 to 0.79. That's real impact, measured and reproducible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Install OpenSearch Search Relevance plugin.&lt;/strong&gt; Follow the official docs: &lt;a href="https://github.com/opensearch-project/search-relevance" rel="noopener noreferrer"&gt;https://github.com/opensearch-project/search-relevance&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Create your first query set.&lt;/strong&gt; Start with 50-100 queries. Validate with users.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set up 2 configurations.&lt;/strong&gt; Baseline (current) and one proposed improvement.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run an experiment.&lt;/strong&gt; Let it complete. Study the results.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Collect judgments.&lt;/strong&gt; Have 5-10 domain experts grade a subset of results.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute metrics.&lt;/strong&gt; Let the plugin calculate nDCG, precision, recall, MRR.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analyze.&lt;/strong&gt; Which config won? By how much? Is it statistically significant?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iterate.&lt;/strong&gt; Deploy the winner, then design the next experiment.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Search quality doesn't happen by accident. It's engineered, measured, and continuously improved.&lt;/p&gt;

&lt;p&gt;The OpenSearch Search Relevance plugin gives you the infrastructure to do this systematically. Query sets, configurations, experiments, judgments, metrics. All connected. All reproducible.&lt;/p&gt;

&lt;p&gt;If you're operating search at scale, you owe it to your users to build this pipeline. Start small. Run your first experiment this week. Measure. Iterate. Improve.&lt;/p&gt;

&lt;p&gt;Your search results will thank you. So will your users.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;About the Author&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I'm Prithvi S, Staff Software Engineer at Cloudera and Open Source Enthusiast. Follow my work on GitHub: &lt;a href="https://github.com/iprithv" rel="noopener noreferrer"&gt;https://github.com/iprithv&lt;/a&gt;&lt;/p&gt;

</description>
      <category>opensearch</category>
      <category>search</category>
      <category>database</category>
      <category>data</category>
    </item>
    <item>
      <title>How Lucene Executes a Boolean Query: The Hidden Optimization Layer</title>
      <dc:creator>Prithvi S</dc:creator>
      <pubDate>Sat, 18 Apr 2026 00:32:12 +0000</pubDate>
      <link>https://dev.to/iprithv/how-lucene-executes-a-boolean-query-the-hidden-optimization-layer-546f</link>
      <guid>https://dev.to/iprithv/how-lucene-executes-a-boolean-query-the-hidden-optimization-layer-546f</guid>
      <description>&lt;p&gt;When you run a search query against Elasticsearch, Solr, or any Lucene-powered system, something remarkable happens under the hood. That simple boolean query you wrote gets transformed, optimized, and executed through a sophisticated pipeline designed to return results in milliseconds. But most developers never see inside that black box. Let's change that.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Journey: From Query to Results
&lt;/h2&gt;

&lt;p&gt;Every time you execute a boolean query like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(status:active AND (category:tech OR category:science)) NOT spam:true
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lucene doesn't just naively search for documents matching each condition. Instead, it performs a series of intelligent transformations and optimizations before a single byte is read from disk. Understanding this journey is the key to writing efficient queries and debugging slow search performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 1: Boolean Query Structure - The Basics
&lt;/h2&gt;

&lt;p&gt;A BooleanQuery in Lucene is built from clauses:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;BooleanQuery&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Builder&lt;/span&gt; &lt;span class="n"&gt;builder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;BooleanQuery&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Builder&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;TermQuery&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Term&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"status"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"active"&lt;/span&gt;&lt;span class="o"&gt;)),&lt;/span&gt; &lt;span class="nc"&gt;BooleanClause&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Occur&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;MUST&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;TermQuery&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Term&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"category"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"tech"&lt;/span&gt;&lt;span class="o"&gt;)),&lt;/span&gt; &lt;span class="nc"&gt;BooleanClause&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Occur&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;SHOULD&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;TermQuery&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Term&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"spam"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"true"&lt;/span&gt;&lt;span class="o"&gt;)),&lt;/span&gt; &lt;span class="nc"&gt;BooleanClause&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Occur&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;MUST_NOT&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;BooleanQuery&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The four clause types create different semantics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MUST&lt;/strong&gt;: Document must match (hard constraint)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SHOULD&lt;/strong&gt;: Document may match (soft constraint, boosts score)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MUST_NOT&lt;/strong&gt;: Document must not match (hard constraint)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FILTER&lt;/strong&gt;: Document must match, but scoring is skipped&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The FILTER clause is often overlooked, but it's one of Lucene's most powerful optimizations for production workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 2: Query Rewriting - The Invisible Optimization
&lt;/h2&gt;

&lt;p&gt;Before anything else happens, BooleanQuery goes through a &lt;strong&gt;rewriting phase&lt;/strong&gt;. This is where Lucene applies transformations to simplify and optimize the query tree.&lt;/p&gt;

&lt;p&gt;Example transformations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A MUST_NOT clause with nothing else becomes impossible (returns no results)&lt;/li&gt;
&lt;li&gt;Single SHOULD clauses in a pure-SHOULD query become OR logic&lt;/li&gt;
&lt;li&gt;Nested BooleanQueries get flattened&lt;/li&gt;
&lt;li&gt;Queries with all FILTER clauses get optimized for constant-score execution
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Query&lt;/span&gt; &lt;span class="n"&gt;rewritten&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;rewrite&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;searcher&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getIndexReader&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The rewritten query may look completely different from the original. This is where Lucene applies domain knowledge about query structures to reduce complexity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 3: Weight Creation - Binding Query to Index Statistics
&lt;/h2&gt;

&lt;p&gt;After rewriting, a Weight object is created. This is the marriage between your query and the actual index data. The Weight object:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Examines index statistics&lt;/strong&gt; - How many documents contain each term? What's the frequency distribution?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Computes normalization factors&lt;/strong&gt; - These will be used during scoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prepares scoring parameters&lt;/strong&gt; - BM25 k1, b values; query boosts; field boosts
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Weight&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createWeight&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;searcher&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;ScoreMode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;TOP_SCORES&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0f&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is where Lucene decides: "Given that 'active' appears in 50% of documents but 'quantum-physics' appears in only 12 documents, how should I weight these terms when scoring?"&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 4: The Scorer Tree - Building the Execution Plan
&lt;/h2&gt;

&lt;p&gt;Now comes the magic. For each segment in the index, Lucene constructs a &lt;strong&gt;Scorer tree&lt;/strong&gt; - a hierarchical structure that knows how to efficiently find matching documents.&lt;/p&gt;

&lt;p&gt;For our example query &lt;code&gt;(status:active AND category:tech) OR category:science&lt;/code&gt;, Lucene builds something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;          DisjunctionScorer (OR)
         /                    \
   ConjunctionScorer (AND)   TermScorer(category:science)
   /                      \
 TermScorer(status:active)  TermScorer(category:tech)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each node in this tree knows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How to advance to the next matching document&lt;/li&gt;
&lt;li&gt;How to compute scores for matching documents&lt;/li&gt;
&lt;li&gt;How to skip documents that can't possibly match&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This tree structure is critical. The order of operations and the types of scorers used determine query performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 5: Conjunction Optimization - The AND is Expensive
&lt;/h2&gt;

&lt;p&gt;Here's a critical insight: finding documents that match &lt;strong&gt;both&lt;/strong&gt; terms (AND/MUST) is expensive. You need to find the intersection of two postings lists.&lt;/p&gt;

&lt;p&gt;Lucene uses several optimizations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Selectivity-Aware Ordering&lt;/strong&gt;&lt;br&gt;
Lucene reorders conjunctions to start with the most selective term first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Instead of: status:active AND rare_term:value&lt;/span&gt;
&lt;span class="c1"&gt;// Lucene does: rare_term:value AND status:active&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;rare_term:value&lt;/code&gt; matches only 5 documents, why waste time iterating through the 50% of documents matching &lt;code&gt;status:active&lt;/code&gt;? Start with the rare term.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Two-Phase Iteration&lt;/strong&gt;&lt;br&gt;
For expensive predicates (like positions or more complex matching), Lucene uses a two-phase approach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Phase 1: Approximate match - "Could this doc match?"
Phase 2: Exact match - "Does this doc actually match?"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This avoids computing expensive term positions for documents that will be filtered out anyway.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Skipping Strategy&lt;/strong&gt;&lt;br&gt;
Postings lists store skip pointers that allow jumping over blocks of documents. If you're looking for docID 10000 and the current posting is at docID 1000, you can skip forward without examining every document in between.&lt;/p&gt;
&lt;h2&gt;
  
  
  Part 6: Disjunction and WAND - The OR Optimization
&lt;/h2&gt;

&lt;p&gt;Disjunctions (OR/SHOULD) are fundamentally different from conjunctions. You don't need to match all clauses; you just need to match at least one.&lt;/p&gt;

&lt;p&gt;For simple disjunctions on small result sets, Lucene just iterates through all matching documents. But when you're returning top-10 results from millions of documents, Lucene applies the &lt;strong&gt;Weak AND (WAND)&lt;/strong&gt; algorithm.&lt;/p&gt;

&lt;p&gt;The WAND algorithm is elegant:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Maintain a threshold&lt;/strong&gt; - The current minimum score of documents we're keeping&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skip high-cost scorers&lt;/strong&gt; - If a scorer can't possibly contribute more points than our threshold, skip it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Update the threshold&lt;/strong&gt; - As we collect results, raise it&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This means for a query like &lt;code&gt;(category:tech OR category:science OR category:business)&lt;/code&gt;, Lucene doesn't score every document in those three categories. It:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Starts with documents from the most selective category&lt;/li&gt;
&lt;li&gt;Uses block-level maximum scores to skip large blocks that can't reach the threshold&lt;/li&gt;
&lt;li&gt;Only evaluates expensive scoring logic when necessary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For queries with 100+ SHOULD clauses (common in recommendation systems), WAND can reduce scoring work by 90%.&lt;/p&gt;
&lt;h2&gt;
  
  
  Part 7: The FILTER Clause Secret - Performance Magic
&lt;/h2&gt;

&lt;p&gt;This is where many developers miss an optimization opportunity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Slow - scoring happens for all status:active documents&lt;/span&gt;
&lt;span class="nc"&gt;BooleanQuery&lt;/span&gt; &lt;span class="n"&gt;slow&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;BooleanQuery&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;TermQuery&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Term&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"status"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"active"&lt;/span&gt;&lt;span class="o"&gt;)),&lt;/span&gt; &lt;span class="nc"&gt;BooleanClause&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Occur&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;MUST&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;TermQuery&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Term&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"content"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"machine learning"&lt;/span&gt;&lt;span class="o"&gt;)),&lt;/span&gt; &lt;span class="nc"&gt;BooleanClause&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Occur&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;MUST&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;// Fast - filtering skips scoring for the status check&lt;/span&gt;
&lt;span class="nc"&gt;BooleanQuery&lt;/span&gt; &lt;span class="n"&gt;fast&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;BooleanQuery&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;TermQuery&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Term&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"status"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"active"&lt;/span&gt;&lt;span class="o"&gt;)),&lt;/span&gt; &lt;span class="nc"&gt;BooleanClause&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Occur&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;FILTER&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;TermQuery&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Term&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"content"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"machine learning"&lt;/span&gt;&lt;span class="o"&gt;)),&lt;/span&gt; &lt;span class="nc"&gt;BooleanClause&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Occur&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;MUST&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The difference: FILTER clauses are executed as &lt;strong&gt;constant-score queries&lt;/strong&gt;. Lucene doesn't compute BM25 scores for the filtering predicate - it just checks "matches or doesn't match" and moves on. All the BM25 scoring effort goes to your actual ranked clause.&lt;/p&gt;

&lt;p&gt;If you have a clause that's just a hard constraint (like &lt;code&gt;status:active&lt;/code&gt; for filtering), use FILTER instead of MUST.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 8: Early Termination - Stopping Before You Read Everything
&lt;/h2&gt;

&lt;p&gt;Here's a subtle but important optimization. When you ask for top-10 results, does Lucene really score every document in the index?&lt;/p&gt;

&lt;p&gt;No. It uses &lt;strong&gt;early termination&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The trick: Lucene tracks the "worst of the best" scores it's collected so far. Once a scorer indicates "I cannot produce any score higher than X for the remaining documents", Lucene can stop.&lt;/p&gt;

&lt;p&gt;This works because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Postings lists are sorted by document ID (not score)&lt;/li&gt;
&lt;li&gt;Block-level statistics tell Lucene the maximum possible score in each block&lt;/li&gt;
&lt;li&gt;If that maximum is lower than our threshold, skip the entire block&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For large indexes with million+ documents, early termination can reduce the number of documents scored from 100% to 5%.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 9: Real Code Example - Understanding explain()
&lt;/h2&gt;

&lt;p&gt;Let's trace a real query using Lucene's explain() API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Query&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;BooleanQuery&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;TermQuery&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Term&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"title"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"machine learning"&lt;/span&gt;&lt;span class="o"&gt;)),&lt;/span&gt; &lt;span class="nc"&gt;BooleanClause&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Occur&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;MUST&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;TermQuery&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Term&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"tags"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"python"&lt;/span&gt;&lt;span class="o"&gt;)),&lt;/span&gt; &lt;span class="nc"&gt;BooleanClause&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Occur&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;SHOULD&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

&lt;span class="nc"&gt;Explanation&lt;/span&gt; &lt;span class="n"&gt;explanation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;searcher&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;explain&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;docID&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;explanation&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;10.5 = score(doc=42)
  10.2 = BM25 score from title:machine learning
    title occurrence frequency (2 times)
    inverse document frequency
    field length normalization
  0.3 = BM25 score from tags:python
    tags occurrence frequency (1 time)
    lower IDF (python is common)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Reading this output shows you exactly how Lucene computed the score. You'll see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which clauses contributed to the score&lt;/li&gt;
&lt;li&gt;Whether term frequency or field length dominated&lt;/li&gt;
&lt;li&gt;How IDF affected ranking&lt;/li&gt;
&lt;li&gt;Whether MUST vs SHOULD affected the calculation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is your debugging tool when a query returns unexpected ordering.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 10: Production Pitfalls and Debugging
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Pitfall 1: Inefficient Query Ordering&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Bad: rare term second&lt;/span&gt;
&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nx"&gt;active&lt;/span&gt; &lt;span class="nc"&gt;AND &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rarely_updated&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// Good: rare term first  &lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rarely_updated&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;AND&lt;/span&gt; &lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nx"&gt;active&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lucene tries to optimize this automatically, but explicit ordering in your application layer helps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pitfall 2: Too Many SHOULD Clauses Without Minimum Match&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Bad: 100 SHOULD clauses, each one scores every document&lt;/span&gt;
&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;category:&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="no"&gt;OR&lt;/span&gt; &lt;span class="nl"&gt;category:&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="no"&gt;OR&lt;/span&gt; &lt;span class="nl"&gt;category:&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="n"&gt;x100&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// Good: use minimum match or filtering&lt;/span&gt;
&lt;span class="n"&gt;setMinimumNumberShouldMatch&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pitfall 3: Scoring When You Only Need Filtering&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Bad: scoring overhead for a filter&lt;/span&gt;
&lt;span class="nl"&gt;status:&lt;/span&gt;&lt;span class="n"&gt;active&lt;/span&gt; &lt;span class="nf"&gt;AND&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;premium:&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// Good: use FILTER&lt;/span&gt;
&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="nl"&gt;status:&lt;/span&gt;&lt;span class="n"&gt;active&lt;/span&gt; &lt;span class="o"&gt;+(&lt;/span&gt;&lt;span class="nl"&gt;premium:&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pitfall 4: Expensive Predicates in Conjunctions&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Bad: phrase queries are expensive, put them last&lt;/span&gt;
&lt;span class="s"&gt;"exact phrase match"&lt;/span&gt; &lt;span class="no"&gt;AND&lt;/span&gt; &lt;span class="nl"&gt;common_term:&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;

&lt;span class="c1"&gt;// Good: filter first, then expensive query&lt;/span&gt;
&lt;span class="nl"&gt;common_term:&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="no"&gt;AND&lt;/span&gt; &lt;span class="s"&gt;"exact phrase match"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Part 11: Visualizing the Execution
&lt;/h2&gt;

&lt;p&gt;When you understand query execution, you start thinking about queries differently:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Most selective terms first&lt;/strong&gt; - Filter out documents early&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hard constraints as FILTER&lt;/strong&gt; - Avoid wasted scoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expensive predicates last&lt;/strong&gt; - Only evaluate when necessary&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch clause count&lt;/strong&gt; - 50+ SHOULD clauses trigger WAND; 100+ becomes slow&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Use tools like Elasticsearch's &lt;code&gt;profile&lt;/code&gt; API to see exactly what happened:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"profile"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"shards"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"searches"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"BooleanQuery"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"(status:active AND content:machine learning)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"time_in_nanos"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1250000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"breakdown"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;450000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"build_scorer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;350000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"create_weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"advance"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;250000&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion: Think Like Lucene
&lt;/h2&gt;

&lt;p&gt;Understanding how Lucene executes boolean queries transforms you from someone who writes queries to someone who can reason about their performance.&lt;/p&gt;

&lt;p&gt;Key takeaways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Rewriting is invisible but powerful&lt;/strong&gt; - Your query gets transformed before execution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scorer trees represent execution plans&lt;/strong&gt; - More selective terms should be first&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WAND optimizes disjunctions&lt;/strong&gt; - SHOULD clauses with 50+ options still perform well&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FILTER beats MUST for hard constraints&lt;/strong&gt; - Skip unnecessary scoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Early termination matters at scale&lt;/strong&gt; - Top-10 doesn't mean score all million documents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools like explain() are your window inside&lt;/strong&gt; - Always inspect queries that surprise you&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The next time you write a search query, picture the scorer tree Lucene will build. Ask yourself: "Is this tree optimized for my workload?" &lt;/p&gt;

&lt;p&gt;That's thinking like Lucene.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Further Reading:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lucene Query Parser syntax&lt;/li&gt;
&lt;li&gt;BM25 scoring model deep dive&lt;/li&gt;
&lt;li&gt;Elasticsearch query profiling guide&lt;/li&gt;
&lt;li&gt;Custom Query implementations in Lucene&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Have you debugged a slow query by understanding its scorer tree? Share your experience in the comments.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;About the Author&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I'm Prithvi S, Staff Software Engineer at Cloudera and Open Source Enthusiast. I work on data systems, LLM-powered applications, and large-scale architectures. Follow my work on GitHub: &lt;a href="https://github.com/iprithv" rel="noopener noreferrer"&gt;https://github.com/iprithv&lt;/a&gt;&lt;/p&gt;

</description>
      <category>lucene</category>
      <category>search</category>
      <category>indexing</category>
      <category>performance</category>
    </item>
    <item>
      <title>The Metadata Tree: How Apache Iceberg Finds the Right Files Without a Database</title>
      <dc:creator>Prithvi S</dc:creator>
      <pubDate>Fri, 17 Apr 2026 00:32:40 +0000</pubDate>
      <link>https://dev.to/iprithv/the-metadata-tree-how-apache-iceberg-finds-the-right-files-without-a-database-hb1</link>
      <guid>https://dev.to/iprithv/the-metadata-tree-how-apache-iceberg-finds-the-right-files-without-a-database-hb1</guid>
      <description>&lt;p&gt;When you're managing petabytes of data across hundreds of machines, every millisecond matters. Most data engineers assume you need a separate metadata system to keep track of what's where. Iceberg proves you don't. Instead, it bakes metadata intelligence directly into the file format itself, creating a self-describing hierarchy that can scale to massive datasets without external dependencies.&lt;/p&gt;

&lt;p&gt;This is the story of how Iceberg's metadata architecture works, why Netflix designed it this way, and what it means for your data platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Metadata Hell at Scale
&lt;/h2&gt;

&lt;p&gt;Let's start with what happens in traditional data lakes.&lt;/p&gt;

&lt;p&gt;You have files on HDFS or cloud storage. Lots of them. Thousands. Millions. Every time someone runs a query, the query engine needs to answer three questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Which files contain data relevant to this query?&lt;/li&gt;
&lt;li&gt;What schema does each file have?&lt;/li&gt;
&lt;li&gt;What values are in each column (for filtering)?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In Hive or standard Parquet setups, you usually solve this with an external metadata store. Maybe it's a Hive metastore (which is just a relational database). You track table locations, schemas, partitions, and statistics in that database. When you write new data, you update the database. When you read, you query the database first.&lt;/p&gt;

&lt;p&gt;This approach has real problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;External dependency&lt;/strong&gt;: Your data isn't independent. The metadata store becomes a single point of failure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistency issues&lt;/strong&gt;: What happens when a write succeeds on HDFS but the metadata update fails? You now have files with no catalog entry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability&lt;/strong&gt;: At petabyte scale, metadata queries become bottlenecks. A single database can't efficiently track billions of files.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time travel is hard&lt;/strong&gt;: Historical metadata isn't naturally preserved. Rollback operations require custom logic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema evolution breaks&lt;/strong&gt;: Renaming columns means updating the entire catalog and potentially invalidating downstream queries.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Iceberg's answer: What if metadata lived with the data, not in a separate system?&lt;/p&gt;

&lt;h2&gt;
  
  
  The Metadata Hierarchy: Five Layers of Intelligence
&lt;/h2&gt;

&lt;p&gt;Iceberg organizes metadata in a strict hierarchy. Each layer builds on the one below it, creating a chain from the catalog pointer all the way down to individual data files.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Catalog (mutable pointer)
    ↓
Metadata File (JSON, immutable)
    ↓
Manifest List (immutable)
    ↓
Manifests (immutable)
    ↓
Data Files (immutable)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's walk through each layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: The Catalog (The Single Mutable Piece)
&lt;/h3&gt;

&lt;p&gt;The catalog is the entry point to your table. It's a simple pointer: "The current state of this table is defined by this metadata file."&lt;/p&gt;

&lt;p&gt;That's it. Just a reference. And this reference is the only thing that ever changes.&lt;/p&gt;

&lt;p&gt;When you execute a write operation, Iceberg doesn't modify the existing metadata. Instead, it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Creates a new metadata file describing the new table state&lt;/li&gt;
&lt;li&gt;Creates new manifest files describing new snapshots&lt;/li&gt;
&lt;li&gt;Writes new data files&lt;/li&gt;
&lt;li&gt;Atomically updates the catalog pointer using compare-and-swap (CAS) logic&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If two writers collide, one's update is rejected, it retries, and the conflict resolves cleanly. This is optimistic concurrency control, and it's elegant because there's only one pointer to update.&lt;/p&gt;

&lt;p&gt;The catalog itself is usually a simple file or an HTTP endpoint. For file-based catalogs, it's literally a pointer to a location in cloud storage. For REST catalogs (used by distributed systems like Polaris), it's an HTTP service that handles the pointer updates.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: The Metadata File (The Table Schema)
&lt;/h3&gt;

&lt;p&gt;Each snapshot has exactly one metadata file. It's a JSON file that describes the entire table at that point in time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Schema&lt;/strong&gt;: Column names, data types, field IDs (more on field IDs later)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partition spec&lt;/strong&gt;: How data is partitioned (by year? month? day?)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Current snapshot ID&lt;/strong&gt;: Which snapshot is currently active&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snapshot history&lt;/strong&gt;: All previous snapshots and their timestamps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Table properties&lt;/strong&gt;: Custom metadata, format version, sort order&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This metadata file is immutable. Once written, it never changes. Every time you write data, a new metadata file is created with an updated snapshot list.&lt;/p&gt;

&lt;p&gt;Why immutability matters: Your entire table history is preserved. You can query any previous snapshot by ID. You can time-travel to last Tuesday. You can audit exactly what changed and when.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: The Manifest List (The Snapshot View)
&lt;/h3&gt;

&lt;p&gt;A manifest list is a file that describes all the manifests for a single snapshot.&lt;/p&gt;

&lt;p&gt;Think of it as a table of contents: "This snapshot contains these manifests, which collectively describe all the data files in the table at this point in time."&lt;/p&gt;

&lt;p&gt;The manifest list includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;References to all manifest files&lt;/li&gt;
&lt;li&gt;Partition specs for each manifest&lt;/li&gt;
&lt;li&gt;Min/max values for each partition (for pruning)&lt;/li&gt;
&lt;li&gt;File counts and row counts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When a query engine wants to scan a snapshot, it reads the manifest list first. This allows it to prune entire manifests before reading individual data files.&lt;/p&gt;

&lt;p&gt;Why this level of indirection? It allows Iceberg to handle large tables elegantly. Instead of one giant list of all files, you have a list of manifests. If you have 10 million data files, you might have 1000 manifest files, and one manifest list. The query engine can prune at the manifest level before drilling into individual files.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 4: Manifests (The File Inventory)
&lt;/h3&gt;

&lt;p&gt;Each manifest is a file that lists a set of data files. It includes metadata about each file:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;File path&lt;/strong&gt;: Location of the data file&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File format&lt;/strong&gt;: Parquet, ORC, Avro&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Row count&lt;/strong&gt;: How many rows in this file&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Byte size&lt;/strong&gt;: File size&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Column statistics&lt;/strong&gt;: Min, max, null count per column&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partition values&lt;/strong&gt;: What partition this file belongs to&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Manifests are where the real filtering magic happens. Before a query engine reads a single data file, it can:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Check partition values: "Do I even need this file for this query?"&lt;/li&gt;
&lt;li&gt;Check column statistics: "Are all values in this column outside my filter range?"&lt;/li&gt;
&lt;li&gt;Prune aggressively: Skip entire files without touching them&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For example, if you query &lt;code&gt;WHERE year = 2024 AND user_id &amp;gt; 1000000&lt;/code&gt;, the manifest can tell you instantly which files have relevant data. No scanning required.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 5: Data Files (The Actual Data)
&lt;/h3&gt;

&lt;p&gt;Finally, at the bottom, you have the data files themselves. These are standard Parquet, ORC, or Avro files. They contain the actual table data.&lt;/p&gt;

&lt;p&gt;The key point: They're immutable. Once written, they never change. All mutations (updates, deletes) are handled higher in the stack through delete files and new snapshot creation.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the Hierarchy Enables Architecture Without Databases
&lt;/h2&gt;

&lt;p&gt;Now you see the full picture. Let's connect it back to the original problem.&lt;/p&gt;

&lt;p&gt;With this hierarchy, Iceberg can answer all three questions without an external metadata store:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Which files are relevant?&lt;/strong&gt; Walk down from the catalog pointer through manifest list to manifests. File-level granularity statistics do the filtering.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What's the schema?&lt;/strong&gt; It's in the metadata file, part of the immutable snapshot.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What values are in each column?&lt;/strong&gt; Statistics are stored in manifests alongside each file reference.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The entire metadata layer is self-contained. You don't need a separate database. You don't need to manage catalog consistency separately from data consistency. It's all atomic, all together.&lt;/p&gt;

&lt;p&gt;And because each snapshot is immutable, your entire table history is naturally preserved. Time travel queries aren't a special feature; they're just reading an older snapshot. Schema changes don't break the system; old data files continue to work with old schemas (thanks to field IDs, which we'll touch on next).&lt;/p&gt;

&lt;h2&gt;
  
  
  Field IDs: Why Names Are Dangerous
&lt;/h2&gt;

&lt;p&gt;This is the secret weapon that makes schema evolution work.&lt;/p&gt;

&lt;p&gt;In traditional tables, columns are identified by position or name. If you rename a column, downstream systems break. If you reorder columns, file readers get confused.&lt;/p&gt;

&lt;p&gt;Iceberg uses field IDs instead. Every column has a unique numeric ID that never changes. When you ALTER TABLE to rename a column, the field ID stays the same. The data file format doesn't care about the name; it reads by ID.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Original table&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;              &lt;span class="c1"&gt;-- field ID 1&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="c1"&gt;-- field ID 2&lt;/span&gt;
  &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;         &lt;span class="c1"&gt;-- field ID 3&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;-- Years later, rename a column&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;RENAME&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;email_address&lt;/span&gt;

&lt;span class="c1"&gt;-- Old data files still have field ID 3 pointing to email data&lt;/span&gt;
&lt;span class="c1"&gt;-- New queries use field ID 3, which now maps to email_address&lt;/span&gt;
&lt;span class="c1"&gt;-- Everything works transparently&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is why Iceberg tables can safely evolve their schemas over years without rewriting historical data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Partition Evolution: Changing Strategy Without Rewrites
&lt;/h2&gt;

&lt;p&gt;The same principle applies to partitioning.&lt;/p&gt;

&lt;p&gt;Let's say you originally partitioned your data by year:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/year=2020/...
/year=2021/...
/year=2022/...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Later, you realize you need monthly partitions for performance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/year=2023/month=01/...
/year=2023/month=02/...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With traditional data lakes, you'd have to rewrite all the old yearly partitioned data to the new monthly scheme. That's hours of work for large tables.&lt;/p&gt;

&lt;p&gt;With Iceberg, you change the partition spec and move on. New writes use the new partition layout. Old writes keep the old layout. The manifest files track which partition spec applies to which data files. When you query, Iceberg handles both seamlessly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hidden Partitioning: The Query Engine Doesn't Know or Care
&lt;/h2&gt;

&lt;p&gt;Here's another elegance: Your query doesn't explicitly reference partitions.&lt;/p&gt;

&lt;p&gt;In Hive, you might write:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nb"&gt;year&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2024&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hive sees the partition column and uses it for partition pruning.&lt;/p&gt;

&lt;p&gt;In Iceberg, you write:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2024-01-01'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You don't mention partitioning at all. You just query the actual column. Iceberg's hidden partitioning handles partition transforms internally. If the table is partitioned by &lt;code&gt;year(event_date)&lt;/code&gt;, Iceberg applies the transform, prunes the right partitions, and returns the answer. The query engine never knows partitioning happened.&lt;/p&gt;

&lt;p&gt;This is powerful because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Queries are simpler and more portable across engines&lt;/li&gt;
&lt;li&gt;You can change partition transforms without rewriting queries&lt;/li&gt;
&lt;li&gt;The partition strategy is an implementation detail, not part of the query contract&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Snapshots: Immutable Points in Time
&lt;/h2&gt;

&lt;p&gt;Every write operation creates a new snapshot. A snapshot is an immutable view of the table at a point in time.&lt;/p&gt;

&lt;p&gt;The catalog points to the current snapshot. Previous snapshots remain accessible. You can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Query snapshot N: &lt;code&gt;SELECT * FROM table VERSION AS OF snapshot_id_12345&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Query by timestamp: &lt;code&gt;SELECT * FROM table FOR SYSTEM_TIME AS OF '2026-04-10'&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Audit history: Inspect the metadata file to see all snapshot timestamps&lt;/li&gt;
&lt;li&gt;Rollback: Point the catalog back to an earlier snapshot&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because snapshots are immutable and complete, time travel isn't a performance problem. You're not scanning all history; you're just reading a different snapshot.&lt;/p&gt;

&lt;h2&gt;
  
  
  Copy-on-Write and Merge-on-Read: Two Paths to Mutation
&lt;/h2&gt;

&lt;p&gt;Iceberg supports two different strategies for handling updates and deletes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Copy-on-Write (CoW):&lt;/strong&gt; When you delete a row, Iceberg reads the entire data file, filters out the deleted row, and writes a new file. The old file is marked as deleted in the manifest.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pros: Clean reads, no delete reconciliation needed&lt;/li&gt;
&lt;li&gt;Cons: Every mutation rewrites files (slow for write-heavy workloads)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Merge-on-Read (MoR):&lt;/strong&gt; When you delete a row, Iceberg writes a separate delete file that marks rows as deleted. At read time, the query engine merges the data files and delete files to reconstruct the current state.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pros: Fast writes, no file rewrites&lt;/li&gt;
&lt;li&gt;Cons: Reads have to do reconciliation work&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The manifest tracks both data files and delete files, so readers know which delete files apply to which data files.&lt;/p&gt;

&lt;h2&gt;
  
  
  Metadata Compaction: Keeping History Manageable
&lt;/h2&gt;

&lt;p&gt;Over months and years, you accumulate many snapshots. The metadata files grow. The manifest lists grow.&lt;/p&gt;

&lt;p&gt;Iceberg has automatic mechanisms to clean this up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Metadata expiration&lt;/strong&gt;: Old snapshots beyond a retention period are marked for deletion&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manifest compaction&lt;/strong&gt;: Small manifest files are merged into larger ones&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Orphan file cleanup&lt;/strong&gt;: Files with no references in active snapshots are removed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can also manually trigger compaction to optimize for query performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Architecture Matters
&lt;/h2&gt;

&lt;p&gt;Let's zoom out. Why is this design so important?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scalability without infrastructure&lt;/strong&gt;: You don't need a separate metadata store. Files themselves carry the metadata. This means Iceberg scales to petabytes without additional complexity.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;ACID correctness&lt;/strong&gt;: The atomic catalog pointer ensures that either a write succeeds completely or fails completely. There's no partial success, no consistency holes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Engine independence&lt;/strong&gt;: The metadata hierarchy is format-agnostic. Spark reads it the same way Trino does. The metadata is the contract between engines, not some database-specific schema.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;History preservation&lt;/strong&gt;: Because snapshots are immutable, you get time travel and audit trails for free. No special feature; just a natural consequence of the design.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Evolution without friction&lt;/strong&gt;: Schema and partition evolution don't require data rewrites. Your table grows and changes safely.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Concurrent writes&lt;/strong&gt;: Optimistic concurrency control at the metadata level means multiple writers can work simultaneously without locking. Conflicts resolve cleanly at the atomic point.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is why Iceberg has become the standard for large-scale analytics. It's not that it's flashy or new. It's that it solves real problems at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bringing It All Together
&lt;/h2&gt;

&lt;p&gt;The next time you write a query against an Iceberg table, remember the architecture beneath it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The catalog pointer found your table&lt;/li&gt;
&lt;li&gt;The metadata file described its schema&lt;/li&gt;
&lt;li&gt;The manifest list identified relevant snapshots&lt;/li&gt;
&lt;li&gt;The manifests pruned unnecessary files using statistics&lt;/li&gt;
&lt;li&gt;The data files provided the actual rows&lt;/li&gt;
&lt;li&gt;Delete files (if any) marked rows as removed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No external database needed. No consistency problems. No performance cliffs at scale.&lt;/p&gt;

&lt;p&gt;Just files, organized intelligently, describing themselves.&lt;/p&gt;

&lt;p&gt;That's Iceberg.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://iceberg.apache.org" rel="noopener noreferrer"&gt;Apache Iceberg Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberg.apache.org/spec/" rel="noopener noreferrer"&gt;Iceberg Specification&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/apache/iceberg" rel="noopener noreferrer"&gt;GitHub Repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://research.netflix.com/publication/iceberg-an-open-table-format-for-analytic-datasets" rel="noopener noreferrer"&gt;"Iceberg: An Open Table Format for Huge Analytic Datasets" (Netflix paper)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;About the Author&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I'm Prithvi S, Staff Software Engineer at Cloudera and Open Source Enthusiast. I work on data systems, LLM-powered applications, and large-scale architectures. Follow my work on GitHub: &lt;a href="https://github.com/iprithv" rel="noopener noreferrer"&gt;https://github.com/iprithv&lt;/a&gt;&lt;/p&gt;

</description>
      <category>iceberg</category>
      <category>data</category>
      <category>architecture</category>
      <category>database</category>
    </item>
    <item>
      <title>Why Polaris Never Touches Your Cloud Credentials: Storage Config Internals</title>
      <dc:creator>Prithvi S</dc:creator>
      <pubDate>Thu, 16 Apr 2026 08:16:23 +0000</pubDate>
      <link>https://dev.to/iprithv/why-polaris-never-touches-your-cloud-credentials-storage-config-internals-45en</link>
      <guid>https://dev.to/iprithv/why-polaris-never-touches-your-cloud-credentials-storage-config-internals-45en</guid>
      <description>&lt;p&gt;Every data engineer has a nightmare: discovering that a credential spreadsheet with AWS keys got committed to git. Or worse, finding that production credentials are sitting in a YAML file on 50 developer laptops.&lt;/p&gt;

&lt;p&gt;Most data platforms solve this by asking you to trust them with your cloud credentials. Snowflake stores them. Hive stores them. Glue stores them. Then they promise really hard not to leak them.&lt;/p&gt;

&lt;p&gt;Apache Polaris takes a different approach entirely. It never asks for your cloud credentials at all.&lt;/p&gt;

&lt;p&gt;Instead, it does something cleverly different: it establishes &lt;em&gt;trust relationships&lt;/em&gt; with your cloud provider, then mints temporary, scoped credentials on-the-fly whenever an engine needs to read or write data. You set it up once. Then Polaris handles the rest.&lt;/p&gt;

&lt;p&gt;This is the foundation of Polaris's entire security model, and it's worth understanding deeply. Not just because it's clever, but because it fundamentally changes what's possible in multi-tenant, regulated, or security-conscious environments.&lt;/p&gt;

&lt;p&gt;Let's dig into how it works.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Traditional Problem: Credential Storage
&lt;/h2&gt;

&lt;p&gt;When you set up Snowflake to read from S3, you provide your AWS credentials. Snowflake stores them (encrypted, they promise). When a query runs, Snowflake uses those credentials to access S3.&lt;/p&gt;

&lt;p&gt;This creates several problems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Long-lived credentials in the system.&lt;/strong&gt; If Snowflake's database gets compromised, those credentials are exposed for months or years until someone notices and rotates them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. One set of credentials for many operations.&lt;/strong&gt; The same credential can be used to read, write, delete, or modify anything in your S3 account. There's no granularity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Difficult audit trails.&lt;/strong&gt; When suspicious S3 access happens, you can't pinpoint which Snowflake query or which user triggered it. The logs just show "snowflake_service_account accessed this bucket."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Compliance friction.&lt;/strong&gt; Regulated organizations (healthcare, finance) have strict rules about where credentials can live. Storing them in Snowflake often violates those policies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Credential rotation is manual and risky.&lt;/strong&gt; You have to update credentials in Snowflake, hope nothing breaks mid-rotation, and coordinate with other systems.&lt;/p&gt;

&lt;p&gt;Polaris was designed to solve all of these at once.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Polaris Does It: The Trust Model
&lt;/h2&gt;

&lt;p&gt;Instead of storing credentials, Polaris stores a &lt;em&gt;configuration&lt;/em&gt; that establishes trust with your cloud provider. Let's walk through S3 as the example.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Register Your Cloud Storage
&lt;/h3&gt;

&lt;p&gt;When you create a catalog in Polaris, you provide a storage configuration. For S3, that looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"storageType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"S3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"externalId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"polaris-prod-7f92ac"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"roleArn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:iam::123456789012:role/polaris-catalog-role"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"bucket"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-company-data-lake"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice what's &lt;em&gt;not&lt;/em&gt; here: no AWS access key. No secret key. No credentials of any kind.&lt;/p&gt;

&lt;p&gt;What &lt;em&gt;is&lt;/em&gt; here is a &lt;strong&gt;reference to an IAM role&lt;/strong&gt; that you've already created in AWS, plus an external ID that makes the trust relationship unique to this Polaris instance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Set Up the Trust Relationship in AWS
&lt;/h3&gt;

&lt;p&gt;Before Polaris can mint credentials, you need to create that IAM role and configure it to trust Polaris. Here's the trust policy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Principal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"AWS"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:iam::AWS_ACCOUNT_ID:user/polaris-service"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sts:AssumeRole"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"StringEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"sts:ExternalId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"polaris-prod-7f92ac"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This says: "The Polaris service can assume this role, but &lt;em&gt;only if&lt;/em&gt; it provides the external ID &lt;code&gt;polaris-prod-7f92ac&lt;/code&gt;."&lt;/p&gt;

&lt;p&gt;The external ID is crucial. It prevents confused deputy attacks. Even if an attacker compromises Polaris, they can't assume random IAM roles in other AWS accounts without the correct external ID.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Attach Policies to the IAM Role
&lt;/h3&gt;

&lt;p&gt;You then attach an S3 policy to that IAM role that limits what it can do:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"s3:GetObject"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3:ListBucket"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:s3:::my-company-data-lake"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:s3:::my-company-data-lake/*"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This role can only read from your data lake bucket. It can't write, delete, or access anything else.&lt;/p&gt;

&lt;p&gt;Now Polaris is set up. It has a configuration (not credentials) that points to this IAM role. It has an external ID. And the trust relationship is wired up in AWS.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Credential Vending Flow
&lt;/h2&gt;

&lt;p&gt;Here's where the magic happens. When a Spark engine wants to read data from a Polaris-managed table:&lt;/p&gt;

&lt;h3&gt;
  
  
  Request Phase
&lt;/h3&gt;

&lt;p&gt;The Spark engine calls the Polaris REST API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;GET /v1/catalogs/my_catalog/namespaces/raw/tables/customers/data
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Polaris receives this request and extracts the context: who is asking, what are they trying to do, and what table do they want to access.&lt;/p&gt;

&lt;h3&gt;
  
  
  Authorization Phase
&lt;/h3&gt;

&lt;p&gt;Polaris checks its RBAC model. Does this principal have TABLE_READ_DATA permission on the customers table? It consults its role hierarchy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The user's identity is bound to a principal role (e.g., "analytics_engineers")&lt;/li&gt;
&lt;li&gt;That principal role is granted a catalog role (e.g., "read_raw_data")&lt;/li&gt;
&lt;li&gt;That catalog role has TABLE_READ_DATA on the customers table&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the authorization check passes, Polaris moves to the next phase.&lt;/p&gt;

&lt;h3&gt;
  
  
  Credential Minting Phase
&lt;/h3&gt;

&lt;p&gt;Polaris looks up the storage configuration for this table. It sees:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;roleArn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::123456789012:role/polaris-catalog-role&lt;/span&gt;
&lt;span class="na"&gt;externalId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;polaris-prod-7f92ac&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It then calls AWS STS AssumeRole:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws sts assume-role &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--role-arn&lt;/span&gt; arn:aws:iam::123456789012:role/polaris-catalog-role &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--external-id&lt;/span&gt; polaris-prod-7f92ac &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--duration-seconds&lt;/span&gt; 900  &lt;span class="c"&gt;# 15 minutes&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;AWS validates the external ID, checks the trust policy, and returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Credentials"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"AccessKeyId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ASIA..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"SecretAccessKey"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"SessionToken"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Expiration"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-04-16T09:28:00Z"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are &lt;em&gt;temporary&lt;/em&gt; credentials. They expire in 15 minutes. They can only do what the IAM role allows (in this case, read S3).&lt;/p&gt;

&lt;h3&gt;
  
  
  Scope Restriction Phase
&lt;/h3&gt;

&lt;p&gt;Polaris could stop here, but it doesn't. It further restricts these credentials to just the table being accessed. It uses S3 path restrictions or additional policy layers to ensure the credential can only touch &lt;code&gt;s3://my-company-data-lake/raw/customers/&lt;/code&gt;, not &lt;code&gt;s3://my-company-data-lake/sensitive/&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Response Phase
&lt;/h3&gt;

&lt;p&gt;Polaris returns the temporary credentials to Spark:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"credentials"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"aws_access_key_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ASIA..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"aws_secret_access_key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"aws_session_token"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"expires_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-04-16T09:28:00Z"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3://my-company-data-lake/raw/customers/"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Spark now has everything it needs. It can read data for 15 minutes. After that, the credential is useless.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters: The Security Benefits
&lt;/h2&gt;

&lt;p&gt;This design has profound implications:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. No long-lived credentials stored anywhere.&lt;/strong&gt; Polaris doesn't store AWS keys. Your laptop doesn't have them. They're generated on-demand and expire quickly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Instant revocation.&lt;/strong&gt; If you need to immediately revoke a user's access, you update their Polaris role. The next credential mint fails. There's no delay.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Audit trails.&lt;/strong&gt; AWS logs show exactly which Polaris instance, with which external ID, assumed the role. You can trace every data access back to a specific user and query.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Fine-grained access control.&lt;/strong&gt; Different tables can have different IAM roles with different permissions. Read-only tables get read-only roles. Write-enabled tables get write roles. A user's access to each table is independently controlled.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Multi-cloud compatibility.&lt;/strong&gt; Polaris supports the same pattern for GCS (using service account tokens) and Azure (using managed identities). The mechanism changes, but the principle is the same: temporary, scoped credentials.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Compliance-friendly.&lt;/strong&gt; Regulated organizations can enforce policies like "credentials must expire in under 30 minutes" or "all access must be auditable." Polaris handles both automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  The GCS and Azure Equivalents
&lt;/h2&gt;

&lt;p&gt;The S3 pattern generalizes to other clouds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Google Cloud Storage
&lt;/h3&gt;

&lt;p&gt;With GCS, you don't provide a credential. Instead, you provide a service account:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"storageType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GCS"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"projectId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-gcp-project"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"serviceAccount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"polaris@my-gcp-project.iam.gserviceaccount.com"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You configure GCS IAM so that Polaris's service account can impersonate a restricted role. When a credential is needed, Polaris calls GCS APIs to get a short-lived access token. Same pattern, different mechanism.&lt;/p&gt;

&lt;h3&gt;
  
  
  Azure
&lt;/h3&gt;

&lt;p&gt;With Azure, you use managed identities:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"storageType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AZURE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"tenantId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"12345678-..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"storageAccount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"mycompanydatalake"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"containerId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"raw"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Polaris (running in a managed identity or service principal) gets a short-lived token from Azure AD. Again, the principle is identical: temporary, scoped, revocable credentials.&lt;/p&gt;

&lt;h2&gt;
  
  
  Credential Caching and Performance
&lt;/h2&gt;

&lt;p&gt;One question you might have: doesn't minting a new credential for every request add latency?&lt;/p&gt;

&lt;p&gt;Yes, but Polaris optimizes for this. It caches credentials locally. If the same user asks for credentials for the same table within a few minutes, Polaris returns the cached credential instead of calling the cloud provider again. This reduces latency to under 10ms in most cases.&lt;/p&gt;

&lt;p&gt;The tradeoff is acceptable: an extra 100-200ms on the first request for a credential is well worth the security benefits of never storing cloud credentials.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment Implications
&lt;/h2&gt;

&lt;p&gt;How do you actually deploy this? Polaris itself needs to run somewhere, and it needs to be able to call AWS STS (or GCS, or Azure AD).&lt;/p&gt;

&lt;p&gt;Typically, you run Polaris in a Kubernetes cluster with a Kubernetes service account. You configure IRSA (IAM Roles for Service Accounts) to bind that service account to an IAM role. Polaris then inherits permissions to call STS.&lt;/p&gt;

&lt;p&gt;The configuration looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;serviceAccount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;eks.amazonaws.com/role-arn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::123456789012:role/polaris-service-role&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means Polaris gets credentials to call AWS APIs, but those credentials are also temporary and scoped. You've just nested trust relationships.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Gets Stored in Polaris
&lt;/h2&gt;

&lt;p&gt;Since Polaris doesn't store cloud credentials, what &lt;em&gt;does&lt;/em&gt; it store?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Storage configurations&lt;/strong&gt; (roleArn, externalId, bucket, project, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Entity metadata&lt;/strong&gt; (table names, schemas, partitions)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RBAC definitions&lt;/strong&gt; (which roles have which privileges)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit logs&lt;/strong&gt; (who accessed what, when)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All of this lives in Polaris's metadata store, typically a PostgreSQL database. The metadata store itself should be encrypted at rest and in transit, but it doesn't contain cloud credentials. Even if the metadata store is compromised, an attacker can't access your data lake.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Example: A Multi-Tenant SaaS
&lt;/h2&gt;

&lt;p&gt;Imagine you're building a data platform SaaS. You have 100 customers, each with their own S3 bucket. You can't ask each customer for their AWS credentials (security nightmare for them). Instead:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Each customer creates an IAM role in their AWS account and trusts your Polaris instance&lt;/li&gt;
&lt;li&gt;They register that role ARN in Polaris during onboarding&lt;/li&gt;
&lt;li&gt;Your single Polaris instance now manages access to 100 buckets securely&lt;/li&gt;
&lt;li&gt;Each customer's queries get credentials scoped to their bucket only&lt;/li&gt;
&lt;li&gt;You can audit which customer accessed what, when&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is impossible with traditional credential storage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Looking Forward: Polaris v1.3.0 Enhancements
&lt;/h2&gt;

&lt;p&gt;Polaris v1.3.0 extends this pattern to &lt;em&gt;federated catalogs&lt;/em&gt;. You can now register external catalogs (Snowflake, AWS Glue, Databricks) with Polaris, and Polaris will vend credentials for them too.&lt;/p&gt;

&lt;p&gt;This means you could have a single Polaris instance managing access across Iceberg catalogs, Glue catalogs, and Snowflake, all without storing credentials for any of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The reason Polaris never touches your cloud credentials is because it doesn't have to. By establishing trust relationships upfront and minting temporary credentials on-demand, Polaris achieves something that traditional data platforms can't: security without storing secrets.&lt;/p&gt;

&lt;p&gt;This is why enterprises are moving to Polaris. Not just because Iceberg is open-source, but because the entire access control model is built for environments where credentials are liabilities, not assets.&lt;/p&gt;

&lt;p&gt;If you're building data infrastructure at scale, this pattern is worth understanding. It might change how you think about credential management in your own systems.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Want to learn more?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://polaris.apache.org/" rel="noopener noreferrer"&gt;Apache Polaris Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberg.apache.org/spec/" rel="noopener noreferrer"&gt;Iceberg REST Spec&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html" rel="noopener noreferrer"&gt;AWS STS AssumeRole&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. Follow my work on GitHub: &lt;a href="https://github.com/iprithv" rel="noopener noreferrer"&gt;https://github.com/iprithv&lt;/a&gt;&lt;/p&gt;

</description>
      <category>polaris</category>
      <category>security</category>
      <category>api</category>
      <category>cloud</category>
    </item>
    <item>
      <title>How OpenSearch Plugins Really Work: Architecture &amp; Extension Points</title>
      <dc:creator>Prithvi S</dc:creator>
      <pubDate>Wed, 15 Apr 2026 00:32:11 +0000</pubDate>
      <link>https://dev.to/iprithv/how-opensearch-plugins-really-work-architecture-extension-points-59k1</link>
      <guid>https://dev.to/iprithv/how-opensearch-plugins-really-work-architecture-extension-points-59k1</guid>
      <description>&lt;h1&gt;
  
  
  How OpenSearch Plugins Really Work: Architecture &amp;amp; Extension Points
&lt;/h1&gt;

&lt;p&gt;OpenSearch is powerful out of the box, but its true flexibility comes from plugins. Yet most developers treat plugins as black boxes: you install them, they work, and you move on. But what if you need to build one? Or understand why a plugin broke after an upgrade? Or design a system that integrates with OpenSearch's plugin ecosystem?&lt;/p&gt;

&lt;p&gt;In this post, I'll walk you through how plugins actually work: compilation, packaging, installation, and the extension points that make customization possible. By the end, you'll understand the mechanics well enough to build your own.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Plugin Lifecycle: From Source to Running Code
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Writing and Compiling a Plugin
&lt;/h3&gt;

&lt;p&gt;A plugin is a Java project with dependencies on OpenSearch core. At minimum, you need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gradle"&gt;&lt;code&gt;&lt;span class="k"&gt;dependencies&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;compileOnly&lt;/span&gt; &lt;span class="s2"&gt;"org.opensearch:opensearch:${opensearch_version}"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;compileOnly&lt;/code&gt; is critical: your plugin compiles against OpenSearch, but doesn't bundle it. The plugin will run inside the OpenSearch JVM, using the host's core libraries.&lt;/p&gt;

&lt;p&gt;Your plugin entry point is a class that extends &lt;code&gt;Plugin&lt;/code&gt;. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyCustomPlugin&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Plugin&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="nc"&gt;SearchPlugin&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;QuerySpec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;?&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;getQueries&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonList&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;QuerySpec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="nc"&gt;MyCustomQuery&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;NAME&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nl"&gt;MyCustomQuery:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;MyCustomQuery&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromXContent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This simple declaration tells OpenSearch: "I provide a custom query type called &lt;code&gt;my_custom_query&lt;/code&gt;."&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Building the Plugin Artifact
&lt;/h3&gt;

&lt;p&gt;When you run &lt;code&gt;gradle build&lt;/code&gt;, you produce a .zip file containing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my-plugin-1.0.0.zip
├── opensearch-plugin-descriptor.properties
├── lib/
│   ├── my-plugin-1.0.0.jar
│   └── my-dependencies.jar (if any third-party libs needed)
├── bin/ (optional: scripts)
└── config/ (optional: default settings)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;opensearch-plugin-descriptor.properties&lt;/code&gt; file is the plugin manifest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;name&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;my-custom-plugin&lt;/span&gt;
&lt;span class="py"&gt;description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;My custom query plugin&lt;/span&gt;
&lt;span class="py"&gt;version&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;1.0.0&lt;/span&gt;
&lt;span class="py"&gt;opensearch.version&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;2.13.0&lt;/span&gt;
&lt;span class="py"&gt;java.version&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;11&lt;/span&gt;
&lt;span class="py"&gt;classname&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;com.example.MyCustomPlugin&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This manifest declares: which OpenSearch version the plugin targets, what Java version it needs, and crucially, the entry point class name.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Installation via the opensearch-plugin Tool
&lt;/h3&gt;

&lt;p&gt;You install via CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./bin/opensearch-plugin &lt;span class="nb"&gt;install &lt;/span&gt;file:///path/to/my-plugin-1.0.0.zip
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tool does several things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Verifies the manifest&lt;/strong&gt; - reads &lt;code&gt;opensearch-plugin-descriptor.properties&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version checks&lt;/strong&gt; - ensures plugin targets the installed OpenSearch version&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extracts&lt;/strong&gt; - unpacks to &lt;code&gt;plugins/my-custom-plugin/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Loads classes&lt;/strong&gt; - prepares the plugin for JVM loading&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Restarts the node&lt;/strong&gt; - required to load the plugin code&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After restart, your plugin code is live.&lt;/p&gt;

&lt;h2&gt;
  
  
  Class Loader Isolation and Bootstrap
&lt;/h2&gt;

&lt;p&gt;Here's where it gets interesting. Your plugin code runs in the same JVM as OpenSearch core. How does OpenSearch prevent your plugin from accidentally (or maliciously) breaking core?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Class Loader Isolation:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;OpenSearch uses a custom &lt;code&gt;PluginClassLoader&lt;/code&gt; for each plugin. This loader is a child of the core class loader, but has its own namespace:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Core classes (org.opensearch.*) resolve from the main class loader&lt;/li&gt;
&lt;li&gt;Plugin classes resolve from the plugin's class loader first&lt;/li&gt;
&lt;li&gt;If a class isn't found in the plugin loader, it falls back to core&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This prevents version conflicts. If your plugin wants to use a specific version of a library, it can bundle it, and its class loader will find that version first without conflicting with core.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bootstrap Contract:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When OpenSearch starts, it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Discovers all plugins in &lt;code&gt;plugins/&lt;/code&gt; directory&lt;/li&gt;
&lt;li&gt;Reads each plugin's descriptor&lt;/li&gt;
&lt;li&gt;Creates a &lt;code&gt;PluginClassLoader&lt;/code&gt; for each&lt;/li&gt;
&lt;li&gt;Instantiates each plugin's entry point class via reflection&lt;/li&gt;
&lt;li&gt;Calls lifecycle methods: &lt;code&gt;onIndexModule()&lt;/code&gt;, &lt;code&gt;onNodeStarted()&lt;/code&gt;, etc.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If a plugin fails to load, OpenSearch will refuse to start. This is intentional: it's safer to fail loudly than to silently omit a plugin that applications might depend on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Extension Points: How Plugins Hook Into OpenSearch
&lt;/h2&gt;

&lt;p&gt;A plugin doesn't have direct access to internal OpenSearch code. Instead, it implements well-defined &lt;strong&gt;extension point interfaces&lt;/strong&gt;. OpenSearch discovers these implementations and calls them at the right moments.&lt;/p&gt;

&lt;h3&gt;
  
  
  SearchPlugin: Custom Query Types and Aggregations
&lt;/h3&gt;

&lt;p&gt;The most common extension point for search-focused plugins:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MySearchPlugin&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Plugin&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="nc"&gt;SearchPlugin&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;QuerySpec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;?&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;getQueries&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Register custom query types&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonList&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;QuerySpec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="nc"&gt;MyQuery&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;NAME&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nl"&gt;MyQuery:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;MyQuery&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromXContent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;AggregationSpec&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;getAggregations&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Register custom aggregations&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonList&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;AggregationSpec&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;MyAggregation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;NAME&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nl"&gt;MyAggregation:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;MyAggregation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;parse&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;ScoreFunctionSpec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;?&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;getScoreFunctions&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Register custom scoring functions&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonList&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ScoreFunctionSpec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="nc"&gt;MyScoreFunction&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;NAME&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nl"&gt;MyScoreFunction:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;MyScoreFunction&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;parse&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once registered, your custom query is available via the REST API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;GET&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/my-index/_search&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"my_custom_query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"field"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"boost"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  ActionPlugin: Custom REST and Transport Actions
&lt;/h3&gt;

&lt;p&gt;For plugins that need custom REST endpoints or transport operations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyActionPlugin&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Plugin&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="nc"&gt;ActionPlugin&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;ActionHandler&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;?,&lt;/span&gt; &lt;span class="o"&gt;?&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;getActions&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonList&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ActionHandler&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="nc"&gt;MyAction&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;INSTANCE&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;TransportMyAction&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;RestHandler&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;getRestHandlers&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Settings&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;RestController&lt;/span&gt; &lt;span class="n"&gt;restController&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; 
            &lt;span class="nc"&gt;ClusterSettings&lt;/span&gt; &lt;span class="n"&gt;clusterSettings&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;IndexScopedSettings&lt;/span&gt; &lt;span class="n"&gt;indexScopedSettings&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="nc"&gt;SettingsFilter&lt;/span&gt; &lt;span class="n"&gt;settingsFilter&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;NamedWriteableRegistry&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;namedWriteableRegistries&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;NamedXContentRegistry&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;namedXContentRegistries&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Supplier&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;DiscoveryNodes&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;nodesInCluster&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="nc"&gt;Supplier&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;ClusterState&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;clusterStateSupplier&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonList&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;RestMyHandler&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you can hit a custom endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;POST /_plugin/my-action
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"param1"&lt;/span&gt;: &lt;span class="s2"&gt;"value"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  MapperPlugin: Custom Field Types
&lt;/h3&gt;

&lt;p&gt;If you need a new field type (beyond standard text, keyword, numeric, etc.):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyMapperPlugin&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Plugin&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="nc"&gt;MapperPlugin&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Mapper&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;TypeParser&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;getMappers&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonMap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="s"&gt;"my_custom_field"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parserContext&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;MyCustomFieldMapper&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parserContext&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you can use it in mappings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;PUT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/my-index&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mappings"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"properties"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"custom_field"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my_custom_field"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"analyzer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"standard"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  EnginePlugin: Custom Lucene Behavior
&lt;/h3&gt;

&lt;p&gt;For advanced use cases, you can hook into the Lucene engine itself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyEnginePlugin&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Plugin&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="nc"&gt;EnginePlugin&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;Optional&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;EngineFactory&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;getEngineFactory&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;IndexSettings&lt;/span&gt; &lt;span class="n"&gt;indexSettings&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Optional&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;of&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;MyCustomEngine&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  IngestPlugin: Custom Processors
&lt;/h3&gt;

&lt;p&gt;For plugins that process documents during ingestion:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyIngestPlugin&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Plugin&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="nc"&gt;IngestPlugin&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Processor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Factory&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;getProcessors&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Processor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Parameters&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonMap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="s"&gt;"my_processor"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;factories&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;MyIngestProcessor&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use in pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;PUT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/_ingest/pipeline/my_pipeline&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"processors"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"my_processor"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"field"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"content"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Real-World Example: The Search Relevance Plugin
&lt;/h2&gt;

&lt;p&gt;OpenSearch's own &lt;strong&gt;search-relevance plugin&lt;/strong&gt; demonstrates these concepts in action. It provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Custom query types for A/B testing search relevance&lt;/li&gt;
&lt;li&gt;Custom aggregations for metrics collection&lt;/li&gt;
&lt;li&gt;REST endpoints to manage experiments&lt;/li&gt;
&lt;li&gt;System indexes (prefixed with &lt;code&gt;.plugins-search-rel-&lt;/code&gt;) to store experiment state&lt;/li&gt;
&lt;li&gt;Concurrent search request deciders (OpenSearch 2.17+) for custom query execution strategies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The plugin is battle-tested in production, used by teams optimizing ranking and relevance across massive datasets.&lt;/p&gt;

&lt;h2&gt;
  
  
  System Indexes: How Plugins Store Their Own State
&lt;/h2&gt;

&lt;p&gt;Most non-trivial plugins need to persist data. Rather than requiring external storage, they use &lt;strong&gt;system indexes&lt;/strong&gt; within OpenSearch itself.&lt;/p&gt;

&lt;p&gt;System indexes are prefixed with &lt;code&gt;.plugins-&lt;/code&gt; or &lt;code&gt;.opendistro-&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.plugins-search-rel-&amp;lt;version&amp;gt;-experiments
.plugins-search-rel-&amp;lt;version&amp;gt;-notes
.plugins-ml-config
.opendistro-job-scheduler-lock
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The challenge: how do you evolve the schema without breaking existing deployments?&lt;/p&gt;

&lt;p&gt;OpenSearch plugins use a &lt;strong&gt;schema versioning pattern&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="no"&gt;SCHEMA_VERSION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"1"&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;ensureIndexInitialized&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(!&lt;/span&gt;&lt;span class="n"&gt;indexExists&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;createIndex&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Object&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;indexMeta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;getIndexMeta&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;currentVersion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;indexMeta&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOrDefault&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"schema_version"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"0"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(!&lt;/span&gt;&lt;span class="n"&gt;currentVersion&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;equals&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="no"&gt;SCHEMA_VERSION&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;migrateSchema&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;currentVersion&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="no"&gt;SCHEMA_VERSION&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;migrateSchema&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;fromVersion&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;toVersion&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Use Put Mapping API to add new fields (additive only)&lt;/span&gt;
    &lt;span class="c1"&gt;// Never remove or change existing field types&lt;/span&gt;
    &lt;span class="n"&gt;putMapping&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;newFields&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Old documents coexist with new schema&lt;/li&gt;
&lt;li&gt;Upgrades are backwards compatible&lt;/li&gt;
&lt;li&gt;No downtime required for schema evolution&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Performance and Reliability Considerations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Startup Time
&lt;/h3&gt;

&lt;p&gt;Each plugin adds to startup time. Large plugins or plugins that do heavy initialization can slow cluster startup. Monitor this in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Class Loader Memory
&lt;/h3&gt;

&lt;p&gt;Each plugin gets its own class loader, holding copies of loaded classes in memory. Many plugins = higher memory footprint. Keep plugin count reasonable.&lt;/p&gt;

&lt;h3&gt;
  
  
  API Stability
&lt;/h3&gt;

&lt;p&gt;OpenSearch's plugin APIs are versioned with OpenSearch itself. When OpenSearch releases a major version, plugins must recompile and test. This is by design: it ensures plugins stay compatible with core.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security
&lt;/h3&gt;

&lt;p&gt;Plugins run in the same JVM as OpenSearch core. A malicious or buggy plugin can crash the entire node. Only install plugins from trusted sources. In multi-tenant environments, consider network isolation or separate clusters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building Your Own Plugin: Where to Start
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Clone the plugin template:&lt;/strong&gt; OpenSearch provides &lt;code&gt;plugin-template&lt;/code&gt; repository&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement your extension point&lt;/strong&gt; (SearchPlugin, ActionPlugin, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write tests&lt;/strong&gt; - use OpenSearch's testing framework&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build the .zip&lt;/strong&gt; - &lt;code&gt;gradle build&lt;/code&gt; produces the artifact&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Install locally&lt;/strong&gt; - &lt;code&gt;./bin/opensearch-plugin install file://...&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test end-to-end&lt;/strong&gt; - verify your REST endpoint/query/aggregation works&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Publish&lt;/strong&gt; - host on artifact repository or GitHub Releases&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;OpenSearch plugins are not magic. They're well-structured Java code that hooks into OpenSearch via extension points. Understanding this architecture demystifies plugin behavior, helps you troubleshoot issues, and opens the door to building custom extensions.&lt;/p&gt;

&lt;p&gt;Whether you're optimizing search relevance, integrating with custom systems, or building observability tooling, the plugin architecture gives you the hooks you need without compromising core stability.&lt;/p&gt;

&lt;p&gt;The next time a plugin breaks after an upgrade, you'll know exactly where to look. And when you need to build one, you'll have a mental model of how the pieces fit together.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Want to explore further?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenSearch Plugin Developer Guide: &lt;a href="https://opensearch.org/docs/latest/plugins/intro/" rel="noopener noreferrer"&gt;https://opensearch.org/docs/latest/plugins/intro/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Plugin Template Repository: &lt;a href="https://github.com/opensearch-project/plugin-template" rel="noopener noreferrer"&gt;https://github.com/opensearch-project/plugin-template&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. Follow my work on GitHub: &lt;a href="https://github.com/iprithv" rel="noopener noreferrer"&gt;https://github.com/iprithv&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>opensearch</category>
      <category>search</category>
      <category>database</category>
      <category>data</category>
    </item>
    <item>
      <title>Inverted Index Explained: How Elasticsearch Achieves Sub-Millisecond Search on Billions of Documents</title>
      <dc:creator>Prithvi S</dc:creator>
      <pubDate>Tue, 14 Apr 2026 00:32:45 +0000</pubDate>
      <link>https://dev.to/iprithv/inverted-index-explained-how-elasticsearch-achieves-sub-millisecond-search-on-billions-of-documents-3la6</link>
      <guid>https://dev.to/iprithv/inverted-index-explained-how-elasticsearch-achieves-sub-millisecond-search-on-billions-of-documents-3la6</guid>
      <description>&lt;p&gt;Imagine you're building a search feature for your product catalog. You have 10 million products, and you need to return relevant results in under 100 milliseconds. You decide to use PostgreSQL's full-text search, so you write:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;to_tsvector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;@@&lt;/span&gt; &lt;span class="n"&gt;plainto_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'wireless headphones'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It works. But then you get 100 million products. Then a billion. The queries crawl from 100ms to 5 seconds. Your users leave. Your boss asks why.&lt;/p&gt;

&lt;p&gt;The answer isn't "use a bigger database." The answer is "use a different data structure."&lt;/p&gt;

&lt;p&gt;Elasticsearch doesn't store data the way PostgreSQL does. It uses something called an &lt;strong&gt;inverted index&lt;/strong&gt;, and that one difference is why Elasticsearch can search a billion documents in 2-5 milliseconds while traditional databases take seconds.&lt;/p&gt;

&lt;p&gt;This post dives into how that magic works.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is an Inverted Index?
&lt;/h2&gt;

&lt;p&gt;Think of a book. At the back, there's an index:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Elasticsearch ... pages 45, 78, 120, 156
Performance ... pages 45, 89, 203
Database ... pages 12, 78, 200
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The index maps &lt;strong&gt;words to page numbers&lt;/strong&gt;. When you want to find information about "Performance," you look it up once and jump directly to those pages. You don't read every single page.&lt;/p&gt;

&lt;p&gt;That's the core idea of an inverted index.&lt;/p&gt;

&lt;p&gt;Now imagine instead of a book, you have documents. Your "index" maps &lt;strong&gt;terms to document IDs&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;elasticsearch&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;doc&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;doc&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;doc&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;doc&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;performance&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;doc&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;doc&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;doc&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;database&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;doc&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;doc&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;doc&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's "inverted" because it flips the relationship. A &lt;strong&gt;forward index&lt;/strong&gt; says "doc1 contains terms: elasticsearch, performance, scalability." An &lt;strong&gt;inverted index&lt;/strong&gt; says "term elasticsearch is in documents: 1, 3, 5, 8."&lt;/p&gt;

&lt;p&gt;Why does this matter? Because searching becomes trivially fast.&lt;/p&gt;

&lt;p&gt;When someone searches for "elasticsearch," Elasticsearch doesn't scan all documents. It looks up "elasticsearch" in the index once and gets back a list of document IDs. Done. O(1) lookup plus a single postings list traversal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Under the Hood: How Elasticsearch Builds and Uses Inverted Indices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Text Analysis (Before Indexing)
&lt;/h3&gt;

&lt;p&gt;Before a document gets indexed, its text goes through an analyzer pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"settings"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"analysis"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"analyzer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"custom_analyzer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"custom"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"char_filter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"html_strip"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"tokenizer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"standard"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"filter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"lowercase"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"stop"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"porter_stem"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Removes HTML tags&lt;/strong&gt; (character filter)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Splits text into tokens&lt;/strong&gt; (tokenizer): "Elasticsearch is powerful" becomes ["Elasticsearch", "is", "powerful"]&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lowercases tokens&lt;/strong&gt; (filter): ["elasticsearch", "is", "powerful"]&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Removes stop words&lt;/strong&gt; (filter): ["elasticsearch", "powerful"]&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stems words&lt;/strong&gt; (filter): ["elasticsearch", "power"]&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now "powerful" and "powers" both map to the same root "power," so a search for "power" finds both.&lt;/p&gt;

&lt;p&gt;The analyzer is completely customizable. For medical documents, you might preserve technical terms. For e-commerce, you might add synonym expansion (so "laptop" matches "notebook").&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Segment Creation (The Immutable Index)
&lt;/h3&gt;

&lt;p&gt;Here's where Elasticsearch gets clever. Instead of maintaining one large mutable index, it creates immutable &lt;strong&gt;segments&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When documents arrive:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;They sit in an in-memory buffer&lt;/li&gt;
&lt;li&gt;Every ~1 second (refresh interval), the buffer flushes to disk as a new segment&lt;/li&gt;
&lt;li&gt;Each segment is an inverted index, but immutable&lt;/li&gt;
&lt;li&gt;Multiple segments are searched in parallel&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Why immutable? Because it's fast. You never have to lock or rebalance. You just append new segments. If a crash happens mid-write, you have the translog to recover from.&lt;/p&gt;

&lt;p&gt;Here's what a tiny two-document segment looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;INVERTED INDEX (Segment 1):

Term          | Postings List (Doc IDs)
--------------|----------------------
elasticsearch | [1, 2]
powerful      | [1]
scales        | [2]
horizontally  | [2]

DOCUMENT STORE:
Doc 1: "elasticsearch is powerful"
Doc 2: "elasticsearch scales horizontally"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Term Lookup (Lightning Fast)
&lt;/h3&gt;

&lt;p&gt;When you search for "elasticsearch," here's what happens:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The query arrives at a coordinator node&lt;/li&gt;
&lt;li&gt;It broadcasts the query to all relevant shards&lt;/li&gt;
&lt;li&gt;Each shard performs a &lt;strong&gt;binary search&lt;/strong&gt; on the sorted terms in its segments&lt;/li&gt;
&lt;li&gt;Found "elasticsearch"? Return the postings list: [1, 2]&lt;/li&gt;
&lt;li&gt;Fetch those documents from the document store&lt;/li&gt;
&lt;li&gt;Return to coordinator, which merges results from all shards&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The magic: &lt;strong&gt;binary search on sorted terms is O(log N)&lt;/strong&gt;. On a million terms, that's ~20 comparisons. Then you get the postings list and you're done.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Segment Merging (Background Optimization)
&lt;/h3&gt;

&lt;p&gt;Over time, you accumulate many small segments. Searching 100 segments is slower than searching 1 large segment. So Elasticsearch periodically merges them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Segment 1 (1000 docs) + Segment 2 (1000 docs) -&amp;gt; Merged Segment (2000 docs)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The merge is invisible to you. It happens in the background. Old segments are deleted. The new merged segment is searched going forward.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Inverted Index Beats Traditional Databases
&lt;/h2&gt;

&lt;p&gt;Let's compare searching 1 billion documents for "elasticsearch":&lt;/p&gt;

&lt;h3&gt;
  
  
  PostgreSQL with Full-Text Search
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;to_tsvector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;@@&lt;/span&gt; &lt;span class="n"&gt;plainto_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'elasticsearch'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Internally, PostgreSQL uses a &lt;strong&gt;B-tree index&lt;/strong&gt;. Here's the problem:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;B-trees are designed for point lookups and range scans&lt;/li&gt;
&lt;li&gt;Full-text search requires traversing multiple branches of the tree&lt;/li&gt;
&lt;li&gt;For "elasticsearch," the database must:

&lt;ul&gt;
&lt;li&gt;Find all occurrences of the term (multiple tree lookups)&lt;/li&gt;
&lt;li&gt;Reconstruct which documents contain them&lt;/li&gt;
&lt;li&gt;Filter by relevance&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;On a billion products, this takes &lt;strong&gt;3-10 seconds&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Elasticsearch
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;GET /products/_search
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"query"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"match"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="s2"&gt;"title"&lt;/span&gt;: &lt;span class="s2"&gt;"elasticsearch"&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Elasticsearch:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Looks up "elasticsearch" in the inverted index (binary search, ~30 comparisons)&lt;/li&gt;
&lt;li&gt;Gets back a postings list&lt;/li&gt;
&lt;li&gt;Fetches the top 10 documents&lt;/li&gt;
&lt;li&gt;Returns results&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Time: &lt;strong&gt;2-5 milliseconds&lt;/strong&gt; on a properly sized cluster.&lt;/p&gt;

&lt;p&gt;The difference: inverted indices are designed specifically for text search. B-trees are not.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Trade-Off: Updates vs Queries
&lt;/h2&gt;

&lt;p&gt;But inverted indices have a cost: updates are expensive.&lt;/p&gt;

&lt;p&gt;When you update a document in Elasticsearch:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The old document is marked for deletion&lt;/li&gt;
&lt;li&gt;A new document is indexed (goes through analysis, creates new segment)&lt;/li&gt;
&lt;li&gt;A merge eventually removes the deleted document&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This takes milliseconds to seconds, not microseconds. Elasticsearch is eventually consistent.&lt;/p&gt;

&lt;p&gt;In PostgreSQL, you just UPDATE a row. Done immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;So when do you use each?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL:&lt;/strong&gt; Transactional workloads, frequent updates, complex joins&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Elasticsearch:&lt;/strong&gt; Text search, logs, analytics, observability&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Real-World Performance Numbers
&lt;/h2&gt;

&lt;p&gt;Here are actual numbers from production Elasticsearch clusters:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Elasticsearch&lt;/th&gt;
&lt;th&gt;PostgreSQL&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Search 1M docs&lt;/td&gt;
&lt;td&gt;2-3ms&lt;/td&gt;
&lt;td&gt;100-200ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Search 1B docs&lt;/td&gt;
&lt;td&gt;5-10ms&lt;/td&gt;
&lt;td&gt;3-10s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aggregation (cardinality)&lt;/td&gt;
&lt;td&gt;5-20ms&lt;/td&gt;
&lt;td&gt;500ms-2s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Index throughput&lt;/td&gt;
&lt;td&gt;100K docs/sec&lt;/td&gt;
&lt;td&gt;10K-50K docs/sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory per 50GB data&lt;/td&gt;
&lt;td&gt;8-16GB (compressed)&lt;/td&gt;
&lt;td&gt;50GB+ (uncompressed)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The compression factor is huge: Elasticsearch's inverted index compresses 3-5x tighter than raw JSON because terms are deduplicated and encoded efficiently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Relevance Scoring with BM25
&lt;/h2&gt;

&lt;p&gt;Now that we can find documents fast, the next question is: which results should be first?&lt;/p&gt;

&lt;p&gt;Elasticsearch uses &lt;strong&gt;BM25&lt;/strong&gt;, a probabilistic relevance framework:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BM25&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;term_frequency&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;document_length&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inverse_document_frequency&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In plain English:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Term frequency:&lt;/strong&gt; how many times does "elasticsearch" appear in the document? (more = higher score)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inverse document frequency:&lt;/strong&gt; how rare is "elasticsearch"? (rare terms like "llama-index" rank higher than common terms like "the")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document length normalization:&lt;/strong&gt; prevent long documents from always ranking highest&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So if you search "elasticsearch performance," a document mentioning "elasticsearch" 5 times and "performance" 3 times ranks higher than a document mentioning each once.&lt;/p&gt;

&lt;p&gt;You can customize this with field boosting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"bool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"must"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"match"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"elasticsearch"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"boost"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"match"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"body"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"elasticsearch"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now matches in the title count twice as much. Perfect for building relevant search experiences.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes (And How to Avoid Them)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Mistake 1: Too Many Shards
&lt;/h3&gt;

&lt;p&gt;You have 1GB of data and create 100 shards. Each shard has 10MB.&lt;/p&gt;

&lt;p&gt;Problem: search latency goes through the roof because you're coordinating across 100 shards, and overhead dominates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule of thumb:&lt;/strong&gt; aim for 10-50GB per shard. If you have 1TB of data, 20-100 shards is reasonable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 2: Ignoring Refresh Interval
&lt;/h3&gt;

&lt;p&gt;You index a document and try to search it immediately. Nothing.&lt;/p&gt;

&lt;p&gt;That's because the default refresh interval is 1 second. Your data sits in the buffer for up to 1 second before becoming searchable.&lt;/p&gt;

&lt;p&gt;For near-real-time search, you might lower this to 100ms. But each refresh creates a new segment, and merging costs CPU.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Balance:&lt;/strong&gt; 500ms-1s for most use cases. Only lower for critical real-time systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 3: Bad Analyzer Configuration
&lt;/h3&gt;

&lt;p&gt;You don't configure an analyzer, so Elasticsearch uses the default standard analyzer.&lt;/p&gt;

&lt;p&gt;Now when users search "AWS S3", they get no results because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"AWS" tokenizes to "aws" (fine)&lt;/li&gt;
&lt;li&gt;"S3" becomes "s" and "3" (terrible)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Custom analyzer with synonym expansion fixes this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"filter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"synonyms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"synonym"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"synonyms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"AWS S3 =&amp;gt; s3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"machine learning =&amp;gt; ml"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion: Why Inverted Index Matters
&lt;/h2&gt;

&lt;p&gt;The inverted index is deceptively simple: a mapping from terms to document IDs. But this simple data structure enables Elasticsearch to do what traditional databases struggle with: search billions of documents in milliseconds.&lt;/p&gt;

&lt;p&gt;The key insights:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Inverted index is designed for text search&lt;/strong&gt;, not general-purpose queries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Immutable segments&lt;/strong&gt; enable fast, lockless indexing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Binary search on terms&lt;/strong&gt; makes lookup blazing fast&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BM25 scoring&lt;/strong&gt; automatically ranks results by relevance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The trade-off:&lt;/strong&gt; fast reads, slower updates. Worth it for search workloads&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you're building a search feature, a logging system, or an observability platform, understanding how Elasticsearch works under the hood will save you from common mistakes and help you build systems that scale.&lt;/p&gt;

&lt;p&gt;Next step? Learn how to scale Elasticsearch horizontally with sharding, tune refresh and flush intervals for your workload, and customize analyzers for your domain.&lt;/p&gt;

&lt;p&gt;Happy searching.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html" rel="noopener noreferrer"&gt;Elasticsearch Inverted Index Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/Okapi_BM25" rel="noopener noreferrer"&gt;BM25 Algorithm Explained&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/elastic/elasticsearch" rel="noopener noreferrer"&gt;GitHub: Elasticsearch Source Code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html" rel="noopener noreferrer"&gt;How to Tune Elasticsearch for Your Workload&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;About the author: I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. Follow my work on GitHub: &lt;a href="https://github.com/iprithv" rel="noopener noreferrer"&gt;https://github.com/iprithv&lt;/a&gt;&lt;/p&gt;

</description>
      <category>elasticsearch</category>
      <category>search</category>
      <category>database</category>
      <category>analytics</category>
    </item>
  </channel>
</rss>
