Multilingual content is common in documentation systems, product catalogs, and knowledge bases. When the same item exists in several languages, search results often become cluttered with multiple versions of the same document.
A typical requirement is to return one document per content group, chosen using a language preference order such as de > en > fr.
This blog post presents a practical pattern for handling language aggregation. The approach is part of the open-source OpenSearch project and is fully supported in Amazon OpenSearch Service, making it suitable for both self-managed clusters and AWS-managed environments.
The Problem
If an article exists in German, English, and French, a standard search will return all three. You want:
- One hit per
crossLanguageGroup - The language with the highest user preference
- Deterministic, predictable selection
Simple deduplication does not work because you must apply a ranking rule across the group.
Solution Overview
The solution relies on three capabilities:
- Field Collapse Groups all translations of the same document.
- Scripted Sort Applies an explicit language ranking.
- Keyword Fields Enable efficient sorting and scripting on language arrays.
Workflow
Figure: Multi-language document search workflow using collapse functionality. The process reduces 6 duplicate documents across German, English, and French to 3 results by grouping cross-language versions and applying language preference ranking.
Index Setup
PUT tmp_multi_lang
{
"mappings": {
"properties": {
"crossLanguageGroup": {
"type": "text",
"fields": { "keyword": { "type": "keyword" }}
},
"languages": {
"type": "text",
"fields": { "keyword": { "type": "keyword" }}
},
"title": {
"type": "text",
"fields": { "keyword": { "type": "keyword" }}
},
"content": { "type": "text" }
}
}
}
Notes:
- crossLanguageGroup stores the logical ID shared by all translations of the same item. Every language variant uses the same value, so collapse can group them reliably.
-
languages is an array of ISO language codes (e.g.,
["de", "en"]). Using an array lets the script evaluate multiple languages if needed. -
.keywordfields are essential because text fields are analyzed. The analyzer splits or lowercases values, which breaks exact matching and makes sorting impossible. - The
.keywordsubfield stores the raw, untouched value, enabling:- deterministic sorting
- exact matches (e.g.,
termqueries) - using values inside Painless scripts
Without .keyword, collapse and scripted sorting would not work correctly.
Sample Data
Load with POST tmp_multi_lang/_bulk using NDJSON:
{ "index": {} }
{ "crossLanguageGroup": "abc123", "languages": ["de"], "title": "Willkommen", "content": "Dies ist eine Einführung. Source: Reply" }
{ "index": {} }
{ "crossLanguageGroup": "abc123", "languages": ["en"], "title": "Welcome", "content": "This is an introduction. Source: Reply" }
{ "index": {} }
{ "crossLanguageGroup": "abc123", "languages": ["fr"], "title": "Bienvenue", "content": "Ceci est une introduction. Source: Reply" }
{ "index": {} }
{ "crossLanguageGroup": "xyz789", "languages": ["en"], "title": "Search", "content": "How to search in OpenSearch. Source: Reply" }
{ "index": {} }
{ "crossLanguageGroup": "xyz789", "languages": ["fr"], "title": "Recherche", "content": "Comment chercher dans OpenSearch. Source: Reply" }
{ "index": {} }
{ "crossLanguageGroup": "def456", "languages": ["de"], "title": "Produktübersicht", "content": "Dies ist eine Produktbeschreibung." }
{ "index": {} }
{ "crossLanguageGroup": "def456", "languages": ["en"], "title": "Product Overview", "content": "This is a product description." }
The Core Query
POST tmp_multi_lang/_search
{
"query": {
"match": { "content": "Reply" }
},
"collapse": {
"field": "crossLanguageGroup.keyword"
},
"sort": [
{
"_script": {
"type": "number",
"order": "asc",
"script": {
"lang": "painless",
"params": {
"lang_order": { "de": 0, "en": 1, "fr": 2 }
},
"source": """
int best = 100;
def order = params.lang_order;
if (doc.containsKey('languages.keyword')) {
for (def l : doc['languages.keyword']) {
if (order.containsKey(l)) {
int ord = (int) order.get(l);
if (ord < best) { best = ord; }
}
}
}
return best;
"""
}
}
},
{ "_score": "desc" }
]
}
How It Works
Query Stage
Matches all documents containing "Reply".
Collapse Stage
Groups documents by crossLanguageGroup.keyword.
Script Sort Stage
- Iterates the
languagesarray - Checks each language in the priority map
- Selects the lowest value (best match)
- Uses
100as fallback
Tie Breaking
If two documents share the same priority, _score decides.
Example Result
abc123/de, abc123/en, abc123/fr, xyz789/en, xyz789/fr, klm654/de
After collapse with de > en > fr:
abc123/de, xyz789/en, klm654/de
Why This Works
- Keyword fields expose doc values, making sorting fast and predictable.
- Scripted sort applies a strict language hierarchy, not a soft boost.
- Collapse guarantees exactly one document per crossLanguageGroup.
-
Painless correctly handles multi-value arrays like
languages.
Everything works together to deliver deterministic, language-aware selection.
Common Errors and Fixes
1. "Text fields are not optimised for operations"
Use .keyword:
doc['languages.keyword']
2. "unknown field [lang]"
lang must be inside the script object.
3. Casting errors
Use explicit casting:
int ord = (int) order.get(l);
4. "Illegal list shortcut value [values]"
Iterate normally:
for (def l : doc['languages.keyword']) { ... }
Alternative Approach: Score-Based Selection
This method uses query-time boosts to encourage certain languages rather than enforcing a strict order. Each language gets a different weight: German +3, English +2, French +1.
When OpenSearch calculates the score, documents that match higher-boosted languages naturally rise to the top.
After scoring, collapse picks the highest-scoring document per crossLanguageGroup.
What this means in practice
- If a group has
de,en, andfr, the German version usually wins because it has the highest boost. - But if the English document has stronger text relevance, its score may exceed the German one.
- The boosts add to the full-text score, so the effect is soft preference, not a strict ranking.
Good fit: simple setups where speed matters and minor inconsistencies are acceptable.
Not ideal: cases requiring deterministic de > en > fr without exceptions.
POST tmp_multi_lang/_search
{
"query": {
"bool": {
"must": { "match": { "content": "Reply" } },
"should": [
{ "term": { "languages.keyword": { "value": "de", "boost": 3 } } },
{ "term": { "languages.keyword": { "value": "en", "boost": 2 } } },
{ "term": { "languages.keyword": { "value": "fr", "boost": 1 } } }
]
}
},
"collapse": { "field": "crossLanguageGroup.keyword" },
"sort": [ { "_score": "desc" } ]
}
Pros: fast and simple
Cons: boosts are additive, not strict priority
Conclusion
Language-aware document aggregation in OpenSearch is solved cleanly by combining collapse, a Painless script-based sort, and keyword-backed language fields. Script sorting provides reliable, deterministic selection, while score-based boosting offers a faster but less strict alternative.
This pattern is useful anywhere multilingual data creates noise in search results. By grouping documents, applying a clear language hierarchy, and keeping sorting deterministic, teams can deliver cleaner UX across documentation portals, product catalogs, and knowledge bases. It also works well with personalization and dynamic language preferences.
If you want to build cleaner multilingual search or explore how to apply language-aware ranking in Amazon OpenSearch Service, feel free to reach out.
At Reply, we help teams design scalable, predictable, and user-centred search workflows — from proof of concept to fully aligned production deployments.

Top comments (0)