DEV Community

Abhishek Gupta
Abhishek Gupta

Posted on • Originally published at devblogs.microsoft.com

Azure Cosmos DB vNext Emulator: Query and Observability Enhancements

The Azure Cosmos DB Linux-based vNext emulator (preview) is a local version of the Azure Cosmos DB service that runs as a Docker container on Linux, macOS, and Windows. It provides a cost-effective way to develop and test applications locally without requiring an Azure subscription or network connectivity.

The latest release brings improvements in two key areas:

  1. Enhanced Query engine: Addresses limitations in JOINs, operators, subdocument handling, and more.
  2. Observability: OpenTelemetry support to collect and correlate metrics, logs, and traces for deeper insight into applications.

Query Improvements

This emulator release enables several query patterns that were previously unsupported. In this post, we'll focus on the following enhancements to query capabilities:

  • Improved JOIN Operations – Support for nested JOINs across multiple array levels, cross-product JOINs, and self-JOINs on primitive arrays
  • Enhanced Operator Support – Support for string manipulation functions (CONCAT, LENGTH) and array operations (ARRAY_LENGTH, direct array indexing)
  • Better Subdocument Handling – Improved querying of deeply nested object properties and proper handling of missing properties

Let's explore these with practical examples.

Improved JOINs

In Azure Cosmos DB, JOINs enable you to flatten and traverse data within documents, allowing you to work with nested and hierarchical data structures.

Let's explore this using a sample family dataset where each document represents a family with nested arrays. This includes three families (Andersen, Wakefield, and Miller) with varying numbers of children and pets, demonstrating how JOINs handle different data structures. Notice the hierarchical nesting: each family has a children array, each child has a pets array, and families also have a separate tags array.

[
    {
        "id": "AndersenFamily",
        "lastName": "Andersen",
        "parents": [
            {"firstName": "Thomas", "relationship": "father"},
            {"firstName": "Mary Kay", "relationship": "mother"}
        ],
        "children": [
            {
                "firstName": "Henriette",
                "grade": 5,
                "pets": [{"name": "Fluffy", "type": "Rabbit"}]
            }
        ],
        "tags": ["seattle", "active", "family-friendly"],
        "address": {"state": "WA", "city": "Seattle"}
    },
    {
        "id": "WakefieldFamily",
        "lastName": "Wakefield",
        "parents": [
            {"firstName": "Robin", "relationship": "mother"},
            {"firstName": "Ben", "relationship": "father"}
        ],
        "children": [
            {
                "firstName": "Jesse",
                "grade": 8,
                "pets": [
                    {"name": "Goofy", "type": "Dog"},
                    {"name": "Shadow", "type": "Horse"}
                ]
            },
            {"firstName": "Lisa", "grade": 1, "pets": []}
        ],
        "tags": ["newyork", "urban"],
        "address": {"state": "NY", "city": "New York"}
    },
    {
        "id": "MillerFamily",
        "lastName": "Miller",
        "parents": [{"firstName": "David", "relationship": "father"}],
        "children": [
            {
                "firstName": "Emma",
                "grade": 6,
                "pets": [{"name": "Whiskers", "type": "Cat"}]
            }
        ],
        "tags": ["boston", "academic"],
        "address": {"state": "MA", "city": "Boston"}
    }
]
Enter fullscreen mode Exit fullscreen mode

Basic JOIN – Flattening Arrays

For example, if you need to generate a list of all children across multiple families for a school roster or activity planning, use the JOIN operator to flatten the children array, creating one result row per child across all family documents.

SELECT f.id, f.lastName, c.firstName, c.grade 
FROM f 
JOIN c IN f.children
Enter fullscreen mode Exit fullscreen mode

This returns one row per child (giving you a flattened view of all children with their family information) – Henriette from the Andersen family, Jesse and Lisa from the Wakefield family, and Emma from the Miller family.

JOIN with Filter

To find children within specific grade ranges, add a WHERE clause to JOIN. This lets you filter results based on properties of the array elements, like selecting only middle school students (grade 6 and above).

SELECT f.id, f.lastName, c.firstName, c.grade 
FROM f 
JOIN c IN f.children 
WHERE c.grade >= 6
Enter fullscreen mode Exit fullscreen mode

The result set includes only Jesse (grade 8) and Emma (grade 6), automatically excluding Henriette and Lisa who are in lower grades.

Nested JOIN – Traversing Hierarchical Data

If you need to traverse multiple levels of your data structure to identify all pets and their owners, use chained JOINs to drill down through nested arrays, moving from families to children to pets in a single query.

SELECT f.id, c.firstName AS child, p.name AS pet, p.type 
FROM f 
JOIN c IN f.children 
JOIN p IN c.pets
Enter fullscreen mode Exit fullscreen mode

You'll see each pet listed with its owner and family: Henriette's rabbit Fluffy, Jesse's dog Goofy and horse Shadow, and Emma's cat Whiskers. Notice that Lisa doesn't appear in the results since her pets array is empty, demonstrating the INNER JOIN behavior.

Cross Product JOIN – Combining Independent Arrays

To associate every child with all of their family's characteristics or tags to understand patterns and preferences, join two arrays from the same document to create a cross product pairing each child with each family tag.

SELECT f.id, c.firstName AS child, t AS tag 
FROM f 
JOIN c IN f.children 
JOIN t IN f.tags
Enter fullscreen mode Exit fullscreen mode

This creates every possible child-tag pairing within each family: Henriette appears three times (paired with "seattle", "active", and "family-friendly"), Jesse and Lisa each appear twice (with "newyork" and "urban"), and Emma appears twice (with "boston" and "academic").

JOIN on Primitive Arrays

Analyzing common characteristics across your entire dataset requires flattening simple value arrays. JOINs work on arrays of primitive values like strings or numbers, in addition to complex objects.

SELECT f.id, f.lastName, t AS tag 
FROM f 
JOIN t IN f.tags
Enter fullscreen mode Exit fullscreen mode

Each tag becomes its own row associated with its family: the Andersen family contributes three tags ("seattle", "active", "family-friendly"), while Wakefield and Miller each contribute two tags.

Self-JOIN – Finding Combinations

Analyzing which characteristics tend to appear together in your data (such as discovering that families tagged "urban" are also often tagged "active") requires finding unique pairs within the same array. You achieve this by joining an array with itself using different aliases.

SELECT f.id, t1 AS tag1, t2 AS tag2 
FROM f 
JOIN t1 IN f.tags 
JOIN t2 IN f.tags 
WHERE t1 < t2
Enter fullscreen mode Exit fullscreen mode

The Andersen family yields three unique pairs from their three tags ("active"+"seattle", "active"+"family-friendly", "family-friendly"+"seattle"), while Wakefield and Miller each produce one pair from their two tags – perfect for discovering which characteristics commonly appear together.

Complex Filters – Multi-Level Conditions

Real-world queries often require filtering on multiple criteria at different nesting levels, like finding all rabbits owned by families in Washington state for a local pet adoption program. You can combine conditions on both root document properties (location) and deeply nested array elements (pet type) in a single query.

SELECT f.id, f.address.state, c.firstName, p.name AS pet 
FROM f 
JOIN c IN f.children 
JOIN p IN c.pets 
WHERE f.address.state = 'WA' AND p.type = 'Rabbit'
Enter fullscreen mode Exit fullscreen mode

The query pinpoints exactly what you're looking for: Henriette's rabbit Fluffy in the Andersen family from Washington state, filtering out all other pets and families in one efficient operation.

Code sample

A Python example demonstrating these JOIN scenarios is available in this GitHub Gist: cosmosdb_join_examples.py

To run the example, start the Azure Cosmos DB Emulator:

docker run -p 8081:8081 -p 8080:8080 -p 1234:1234 mcr.microsoft.com/cosmosdb/linux/azure-cosmos-emulator:vnext-preview --protocol http
Enter fullscreen mode Exit fullscreen mode

Download and run the script from the gist – it automatically sets up the database, loads the sample data, and runs each query scenario:

python cosmosdb_join_examples.py
Enter fullscreen mode Exit fullscreen mode

Enhanced Operator Support

The emulator now supports operators related to string manipulation, array operations and more. Let's explore these capabilities using an employee dataset where each document contains personal information, skill tags, and performance scores.

[
    {
        "id": "1",
        "firstName": "Alice",
        "email": "alice@company.com",
        "tags": ["senior", "manager", "remote"],
        "scores": [85, 92, 78, 95]
    },
    {
        "id": "2",
        "firstName": "Bob",
        "email": "bob@company.com",
        "tags": ["junior", "developer"],
        "scores": [72, 88, 65, 91]
    },
    {
        "id": "3",
        "firstName": "Charlie",
        "email": "charlie@company.com",
        "tags": ["senior", "architect", "onsite"],
        "scores": [95, 87, 92, 89]
    }
]
Enter fullscreen mode Exit fullscreen mode

String Operations

Build formatted contact lists by combining multiple fields:

SELECT CONCAT(e.firstName, ' - ', e.email) as contact FROM e
Enter fullscreen mode Exit fullscreen mode

This produces formatted strings like "Alice – alice@company.com", useful for generating reports or user-facing displays.

Filter records based on text length for data validation:

SELECT e.firstName FROM e WHERE LENGTH(e.firstName) > 5
Enter fullscreen mode Exit fullscreen mode

This returns only "Charlie" since it's the only name exceeding 5 characters.

Array Operations

Count elements in nested arrays to identify employees with multiple skills:

SELECT e.firstName, ARRAY_LENGTH(e.tags) as skillCount FROM e
Enter fullscreen mode Exit fullscreen mode

This returns the number of tags each employee has – Alice and Charlie each have 3 skills, while Bob has 2.

Query specific array positions to filter by criteria like first performance score:

SELECT e.firstName FROM e WHERE e.scores[0] > 85
Enter fullscreen mode Exit fullscreen mode

This returns "Charlie" whose first score (95) exceeds the threshold, enabling queries on specific elements without flattening the entire array.

Code sample

A Python example demonstrating these operator scenarios above is available in this GitHub Gist: cosmosdb_operator_examples.py

To run the example, start the Azure Cosmos DB Emulator:

docker run -p 8081:8081 -p 8080:8080 -p 1234:1234 mcr.microsoft.com/cosmosdb/linux/azure-cosmos-emulator:vnext-preview --protocol http
Enter fullscreen mode Exit fullscreen mode

Download and run the script from the gist – it automatically sets up the database, loads the sample data, and runs each query scenario:

python cosmosdb_operator_examples.py
Enter fullscreen mode Exit fullscreen mode

Improved Subdocument Queries

The emulator now provides better support for querying nested object properties and subdocuments. Let's explore these capabilities using a user profile dataset where each document contains nested profile information, contact details, and settings.

[
    {
        "id": "1",
        "username": "user1",
        "profile": {
            "fullName": "John Doe",
            "age": 30,
            "contact": {
                "email": "john@example.com",
                "phone": "555-0101"
            }
        },
        "settings": {
            "theme": "dark",
            "notifications": true
        }
    },
    {
        "id": "2",
        "username": "user2",
        "profile": {
            "fullName": "Jane Smith",
            "age": 25
        },
        "settings": {
            "theme": "light",
            "notifications": false
        }
    }
]
Enter fullscreen mode Exit fullscreen mode

Selecting Nested Properties with IS_DEFINED

This query selects nested properties while ensuring the parent object exists. It returns all usernames with their full names, correctly handling documents with defined profile objects:

SELECT c.username, c.profile.fullName 
FROM c 
WHERE IS_DEFINED(c.profile)
Enter fullscreen mode Exit fullscreen mode

Checking Nested Properties

To identify users with complete contact information before processing, use this query that returns only users who have the profile.contact defined:

SELECT c.id, c.username 
FROM c 
WHERE IS_DEFINED(c.profile.contact)
Enter fullscreen mode Exit fullscreen mode

Handling Missing Properties

Cases where certain nested properties don't exist need to be handled gracefully. This query accesses a property that may not be present in all documents. It returns the document with the phone property omitted from results when it doesn't exist.

SELECT c.id, c.profile.contact.phone 
FROM c
Enter fullscreen mode Exit fullscreen mode

Code sample

A Python example demonstrating these subdocument query scenarios above is available in this GitHub Gist: cosmosdb_subdocument_examples.py

To run the example, start the Azure Cosmos DB Emulator:

docker run -p 8081:8081 -p 8080:8080 -p 1234:1234 mcr.microsoft.com/cosmosdb/linux/azure-cosmos-emulator:vnext-preview --protocol http
Enter fullscreen mode Exit fullscreen mode

Download and run the script from the gist – it automatically sets up the database, loads the sample data, and runs each query scenario:

python cosmosdb_subdocument_examples.py
Enter fullscreen mode Exit fullscreen mode

OpenTelemetry for Observability

OpenTelemetry is an open-source observability framework that provides a collection of tools, APIs, and SDKs for instrumenting, generating, collecting, and exporting telemetry data.

To activate OpenTelemetry in the Azure Cosmos DB Emulator, set the "--enable-otlp" flag to "true" (or ENABLE_OTLP_EXPORTER environment variable) when starting the Docker container. For example:

docker run -p 8081:8081 -p 8080:8080 -p 1234:1234 mcr.microsoft.com/cosmosdb/linux/azure-cosmos-emulator:vnext-preview --protocol http --enable-otlp true
Enter fullscreen mode Exit fullscreen mode

You can also use the "--enable-console" flag (or use the ENABLE_CONSOLE_EXPORTER environment variable) to enable console output of telemetry data that can be useful for debugging purposes.

Use Docker Compose to simplify OpenTelemetry setup with the Azure Cosmos DB Emulator. The configuration below connects the emulator with Jaeger (distributed tracing) and Prometheus (metrics collection):

Copy and save the following in a docker-compose.yml file:

services:
  jaeger:
    image: jaegertracing/jaeger:latest
    container_name: jaeger
    ports:
      - "16686:16686"
      - "4317:4317"
      - "4318:4318"
    networks:
      - cosmosdb-network

  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    networks:
      - cosmosdb-network
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--web.enable-otlp-receiver'
      - '--storage.tsdb.path=/prometheus'

  cosmosdb_emulator:
    image: mcr.microsoft.com/cosmosdb/linux/azure-cosmos-emulator:vnext-preview
    container_name: cosmosdb_emulator
    ports:
      - "8081:8081"
      - "1234:1234"
      - "9712:9712"
      - "8889:8889"
    environment:
      - ENABLE_OTLP_EXPORTER=true
      - ENABLE_CONSOLE_EXPORTER=true
    networks:
      - cosmosdb-network

networks:
  cosmosdb-network:
Enter fullscreen mode Exit fullscreen mode

Also, make sure to create a prometheus.yml file in the same directory as your docker-compose.yml with the following content – this will ensure that Prometheus can scrape metrics from the Azure Cosmos DB Emulator:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'cosmosdb-emulator-metrics'
    scrape_interval: 5s
    static_configs:
      - targets: ['cosmosdb_emulator:8889']
Enter fullscreen mode Exit fullscreen mode

If the emulator is already running, stop it first. Then start the entire stack (Azure Cosmos DB Emulator, Jaeger, and Prometheus) using Docker Compose:

docker-compose up -d
Enter fullscreen mode Exit fullscreen mode

Wait for the services to start, and run the sample Python script from the previous sections (a few times) – this will generate some telemetry data.

Check the Jaeger UI at http://localhost:16686 to view distributed traces, search for specific operations, analyze request latency, and inspect the execution flow of your queries through the Azure Cosmos DB Emulator:

Jaeger portal

Check the Prometheus UI at http://localhost:9090 to query and visualize metrics, monitor request rates, track query performance statistics, and analyze resource utilization patterns from the Azure Cosmos DB Emulator.

Prometheus portal

For details, refer to the documentation

Health Probe

The emulator exposes health check endpoints on port "8080" that return a HTTP 200 OK response when all components (Gateway, Data Explorer, etc.) are fully operational.

This is particularly valuable for integration testing frameworks like Testcontainers, which provide waiting strategies to ensure containers are ready before tests run. Instead of using arbitrary sleep delays or log messages (which are unreliable since they may change), you can configure a waiting strategy that will check if the emulator Docker container is listening to the "8080" health check port.

Here's a Go example using testcontainers-go library:

func TestMain(m *testing.M) {
    req := testcontainers.ContainerRequest{
        Image:        emulatorImage,
        ExposedPorts: []string{"8081:8081", "8080:8080"},
        WaitingFor:   wait.ForListeningPort(nat.Port("8080")), // Wait for health check port
        // WaitingFor: wait.ForLog("PostgreSQL=OK, Gateway=OK, Explorer=OK, Ready=YES"), //this works but is less reliable, hence not recommended

    }

    _, err := testcontainers.GenericContainer(context.Background(), testcontainers.GenericContainerRequest{
        ContainerRequest: req,
        Started:          true,
    })

    if err != nil {
        log.Fatal(err)
    }

    // Create credential and client
    cred, err := azcosmos.NewKeyCredential(emulatorKey)
    if err != nil {
        log.Fatal(err)
    }

    client, err = azcosmos.NewClientWithKey(emulatorEndpoint, cred, nil)
    if err != nil {
        log.Fatal(err)
    }

    m.Run()
}
Enter fullscreen mode Exit fullscreen mode

Available endpoints:

Endpoint Purpose Behavior
http://localhost:8080/alive Liveness probe Returns 503 if PostgreSQL and Gateway are unhealthy
http://localhost:8080/ready Readiness probe Returns 503 if any component is not operational (disabled components like Explorer are excluded)
http://localhost:8080/status Status information Always returns 200 with detailed JSON status for all components

Conclusion

The latest Azure Cosmos DB Emulator release eliminates key query limitations and adds observability features, enabling you to develop and test more complex scenarios locally. You can now implement advanced data access patterns with nested JOINs, leverage operator support for string and array manipulation, and monitor your application's behavior with OpenTelemetry. You can start using these capabilities to accelerate your development workflow and catch issues earlier in your development cycle.

Top comments (0)