Alexey Vidanov for AWS Community Builders

Posted on Nov 26 • Originally published at Medium

Language Aggregation in OpenSearch: Selecting One Document Per Group by Language Preference

#opensearch #elasticsearch #search #tutorial

Multilingual content is common in documentation systems, product catalogs, and knowledge bases. When the same item exists in several languages, search results often become cluttered with multiple versions of the same document.

A typical requirement is to return one document per content group, chosen using a language preference order such as de > en > fr.

This blog post presents a practical pattern for handling language aggregation. The approach is part of the open-source OpenSearch project and is fully supported in Amazon OpenSearch Service, making it suitable for both self-managed clusters and AWS-managed environments.

The Problem

If an article exists in German, English, and French, a standard search will return all three. You want:

One hit per crossLanguageGroup
The language with the highest user preference
Deterministic, predictable selection

Simple deduplication does not work because you must apply a ranking rule across the group.

Solution Overview

The solution relies on three capabilities:

Field Collapse Groups all translations of the same document.
Scripted Sort Applies an explicit language ranking.
Keyword Fields Enable efficient sorting and scripting on language arrays.

Workflow

Figure: Multi-language document search workflow using collapse functionality. The process reduces 6 duplicate documents across German, English, and French to 3 results by grouping cross-language versions and applying language preference ranking.

Index Setup

PUT tmp_multi_lang
{
  "mappings": {
    "properties": {
      "crossLanguageGroup": {
        "type": "text",
        "fields": { "keyword": { "type": "keyword" }}
      },
      "languages": {
        "type": "text",
        "fields": { "keyword": { "type": "keyword" }}
      },
      "title": {
        "type": "text",
        "fields": { "keyword": { "type": "keyword" }}
      },
      "content": { "type": "text" }
    }
  }
}

Notes:

crossLanguageGroup stores the logical ID shared by all translations of the same item. Every language variant uses the same value, so collapse can group them reliably.
languages is an array of ISO language codes (e.g., ["de", "en"]). Using an array lets the script evaluate multiple languages if needed.
.keyword fields are essential because text fields are analyzed. The analyzer splits or lowercases values, which breaks exact matching and makes sorting impossible.
The .keyword subfield stores the raw, untouched value, enabling:
- deterministic sorting
- exact matches (e.g., term queries)
- using values inside Painless scripts

Without .keyword, collapse and scripted sorting would not work correctly.

Sample Data

Load with POST tmp_multi_lang/_bulk using NDJSON:

{ "index": {} }
{ "crossLanguageGroup": "abc123", "languages": ["de"], "title": "Willkommen", "content": "Dies ist eine Einführung. Source: Reply" }
{ "index": {} }
{ "crossLanguageGroup": "abc123", "languages": ["en"], "title": "Welcome", "content": "This is an introduction. Source: Reply" }
{ "index": {} }
{ "crossLanguageGroup": "abc123", "languages": ["fr"], "title": "Bienvenue", "content": "Ceci est une introduction. Source: Reply" }

{ "index": {} }
{ "crossLanguageGroup": "xyz789", "languages": ["en"], "title": "Search", "content": "How to search in OpenSearch. Source: Reply" }
{ "index": {} }
{ "crossLanguageGroup": "xyz789", "languages": ["fr"], "title": "Recherche", "content": "Comment chercher dans OpenSearch. Source: Reply" }

{ "index": {} }
{ "crossLanguageGroup": "def456", "languages": ["de"], "title": "Produktübersicht", "content": "Dies ist eine Produktbeschreibung." }
{ "index": {} }
{ "crossLanguageGroup": "def456", "languages": ["en"], "title": "Product Overview", "content": "This is a product description." }

The Core Query

POST tmp_multi_lang/_search
{
  "query": {
    "match": { "content": "Reply" }
  },
  "collapse": {
    "field": "crossLanguageGroup.keyword"
  },
  "sort": [
    {
      "_script": {
        "type": "number",
        "order": "asc",
        "script": {
          "lang": "painless",
          "params": {
            "lang_order": { "de": 0, "en": 1, "fr": 2 }
          },
          "source": """
            int best = 100;
            def order = params.lang_order;
            if (doc.containsKey('languages.keyword')) {
              for (def l : doc['languages.keyword']) {
                if (order.containsKey(l)) {
                  int ord = (int) order.get(l);
                  if (ord < best) { best = ord; }
                }
              }
            }
            return best;
          """
        }
      }
    },
    { "_score": "desc" }
  ]
}

How It Works

Query Stage

Matches all documents containing "Reply".

Collapse Stage

Groups documents by crossLanguageGroup.keyword.

Script Sort Stage

Iterates the languages array
Checks each language in the priority map
Selects the lowest value (best match)
Uses 100 as fallback

Tie Breaking

If two documents share the same priority, _score decides.

Example Result

abc123/de, abc123/en, abc123/fr, xyz789/en, xyz789/fr, klm654/de

After collapse with de > en > fr:

abc123/de, xyz789/en, klm654/de

Why This Works

Keyword fields expose doc values, making sorting fast and predictable.
Scripted sort applies a strict language hierarchy, not a soft boost.
Collapse guarantees exactly one document per crossLanguageGroup.
Painless correctly handles multi-value arrays like languages.

Everything works together to deliver deterministic, language-aware selection.

Common Errors and Fixes

1. "Text fields are not optimised for operations"

Use .keyword:

doc['languages.keyword']

2. "unknown field [lang]"

lang must be inside the script object.

3. Casting errors

Use explicit casting:

int ord = (int) order.get(l);

4. "Illegal list shortcut value [values]"

Iterate normally:

for (def l : doc['languages.keyword']) { ... }

Alternative Approach: Score-Based Selection

This method uses query-time boosts to encourage certain languages rather than enforcing a strict order. Each language gets a different weight: German +3, English +2, French +1.

When OpenSearch calculates the score, documents that match higher-boosted languages naturally rise to the top.

After scoring, collapse picks the highest-scoring document per crossLanguageGroup.

What this means in practice

If a group has de, en, and fr, the German version usually wins because it has the highest boost.
But if the English document has stronger text relevance, its score may exceed the German one.
The boosts add to the full-text score, so the effect is soft preference, not a strict ranking.

Good fit: simple setups where speed matters and minor inconsistencies are acceptable.

Not ideal: cases requiring deterministic de > en > fr without exceptions.

POST tmp_multi_lang/_search
{
  "query": {
    "bool": {
      "must": { "match": { "content": "Reply" } },
      "should": [
        { "term": { "languages.keyword": { "value": "de", "boost": 3 } } },
        { "term": { "languages.keyword": { "value": "en", "boost": 2 } } },
        { "term": { "languages.keyword": { "value": "fr", "boost": 1 } } }
      ]
    }
  },
  "collapse": { "field": "crossLanguageGroup.keyword" },
  "sort": [ { "_score": "desc" } ]
}

Pros: fast and simple

Cons: boosts are additive, not strict priority

Conclusion

Language-aware document aggregation in OpenSearch is solved cleanly by combining collapse, a Painless script-based sort, and keyword-backed language fields. Script sorting provides reliable, deterministic selection, while score-based boosting offers a faster but less strict alternative.

This pattern is useful anywhere multilingual data creates noise in search results. By grouping documents, applying a clear language hierarchy, and keeping sorting deterministic, teams can deliver cleaner UX across documentation portals, product catalogs, and knowledge bases. It also works well with personalization and dynamic language preferences.

If you want to build cleaner multilingual search or explore how to apply language-aware ranking in Amazon OpenSearch Service, feel free to reach out.

At Reply, we help teams design scalable, predictable, and user-centred search workflows — from proof of concept to fully aligned production deployments.

DEV Community