DEV Community: Nathan Fries

Neural Search Quickstart for OpenSearch

Nathan Fries — Thu, 03 Aug 2023 00:30:00 +0000

Intro

Neural Search is now generally available as of OpenSearch version 2.9.0. The Neural Search plugin allows the use of pre-trained models for indexing text segments to existing k-NN enabled indices, as well as transforming queries into the same vector space. You don't need to train any models yourself, and you don't need any specialized hardware. All you need is Docker (well, at least in order to get up and running in five minutes following this guide).

Dependencies for this project

Neural Search

Neural Search is OpenSearch's end to end vector embedding and search solution, leveraging existing k-NN search features and off-the-shelf NLP models for vector encoding. This example is using the huggingface/sentence-transformers/all-MiniLM-L12-v2 model, which transforms sentences and paragraphs into 384 dimensional vector space.

Suppose we have a list of movie titles, and would like to provide some basic semantic search capability. Historically, this would be achieved by having high quality metadata available for searching based on relevance. Descriptions, lists of actors, genres, etc. would be needed to ensure high quality search results. In this example, we are working with a lot less data, yet the semantic search capability is still quite usable for many applications.

For example, the following list of movies are inserted into the movies index. Note that the only information we have available is the move title.

POST /movies/_bulk
{ "index": {} }
{ "title": "Star Wars" }
{ "index": {} }
{ "title": "Lord of the Rings" }
{ "index": {} }
{ "title": "Spiderman" }
{ "index": {} }
{ "title": "Indiana Jones" }

Here is an example search performed using the "neural_search_template" search template.

GET /movies/_search/template
{
  "id": "neural_search_template",
  "params": {
    "query": "lightsaber",
    "from": 0,
    "size": 1
  }
}

In the result, we can see that "Star Wars" is the top hit. If you are familiar with Elasticsearch or OpenSearch, but haven't used Neural Search or another means of NLP and k-NN search, this might be surprising.

{
  "took": 40,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 7,
      "relation": "eq"
    },
    "max_score": 0.5106329,
    "hits": [
      {
        "_index": "movies",
        "_id": "ZluZuIkBNHby32LgCK2-",
        "_score": 0.5106329,
        "fields": {
          "title": [
            "Star Wars"
          ]
        }
      }
    ]
  }
}

Starting OpenSearch

Here is an example Docker Compose file that configures one master node as a data node, and one dedicated ML node. This configuration should be used for development purposes only, it is not a production ready configuration. This file is provided in the example repo if you prefer to pull it instead of creating your own docker-compose.yml file.

version: '3.9'
services:
  opensearch-node1:
    image: opensearchproject/opensearch:2.9.0
    container_name: opensearch-data-node
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node1
      - discovery.seed_hosts=opensearch-node1,opensearch-node2
      - cluster.initial_cluster_manager_nodes=opensearch-node1
      - bootstrap.memory_lock=true
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m"
      - "DISABLE_INSTALL_DEMO_CONFIG=true"
      - "DISABLE_SECURITY_PLUGIN=true"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    volumes:
      - opensearch-data1:/usr/share/opensearch/data
    ports:
      - 9200:9200
      - 9600:9600
    networks:
      - opensearch-net

  opensearch-node2:
    image: opensearchproject/opensearch:2.9.0
    container_name: opensearch-ml-node
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node2
      - node.roles=ml
      - discovery.seed_hosts=opensearch-node1,opensearch-node2
      - cluster.initial_cluster_manager_nodes=opensearch-node1
      - bootstrap.memory_lock=true
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m"
      - "DISABLE_INSTALL_DEMO_CONFIG=true"
      - "DISABLE_SECURITY_PLUGIN=true"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    volumes:
      - opensearch-data2:/usr/share/opensearch/data
    networks:
      - opensearch-net
  opensearch-dashboards:
    image: opensearchproject/opensearch-dashboards:latest
    container_name: opensearch-dashboards
    ports:
      - 5601:5601
    expose:
      - "5601"
    environment:
      - 'OPENSEARCH_HOSTS=["http://opensearch-node1:9200","http://opensearch-node2:9200"]'
      - "DISABLE_SECURITY_DASHBOARDS_PLUGIN=true"
    networks:
      - opensearch-net

volumes:
  opensearch-data1:
  opensearch-data2:

networks:
  opensearch-net:

Assuming Docker is installed on your system, it should start with the following command.

docker compose up -d

This should start three containers:

opensearch-dashboards
opensearch-data-node
opensearch-ml-node

Uploading an ML Model

For this example, we will be using an off-the-shelf model provided on Hugging Face: huggingface/sentence-transformers/all-MiniLM-L12-v2. We need to upload the model to OpenSearch. Open OpenSearch Dashboards at http://localhost:5601 and navigate to Dev Tools. All queries are also listed in queries.txt in the example repo for convenience.

We can use the following request to upload the model to our OpenSearch host.

POST /_plugins/_ml/models/_upload
{
  "name": "huggingface/sentence-transformers/all-MiniLM-L12-v2",
  "version": "1.0.1",
  "model_format": "TORCH_SCRIPT"
}

The response should look something like this:

{
  "task_id": "-97CuIkBWLo6ADqAT1Tn",
  "status": "CREATED"
}

Note your task ID! We're going to use it to check on the status, as it is an asynchronous operation. We need to wait for this to complete before proceeding. Your task ID will be different, you should replace <task_id> with yours. Run the following command to check the status of the upload.

GET /_plugins/_ml/tasks/<task_id>

When the state of the task status says COMPLETED you should be able to copy the model_id. The model_id is going to be used several times.

{
  "model_id": "JHrCuIkBkyiJw75fUf-n",
  "task_type": "REGISTER_MODEL",
  "function_name": "TEXT_EMBEDDING",
  "state": "COMPLETED",
  "worker_node": [
    "VF5J2Z00QNGLfL4l6c5e1w"
  ],
  "create_time": 1691021889469,
  "last_update_time": 1691021945252,
  "is_async": true
}

The model now needs to be loaded. Run the following command, replacing <model_id> with the model_id from the resulting task.

POST /_plugins/_ml/models/<model_id>/_load

This is another asynchronous operation, with another task_id. Repeat the task command with the new task_id.

GET /_plugins/_ml/tasks/<task_id>

When that task is completed, we can proceed with creating an ingest pipeline that will use this model. The model_id will not change after uploading.

Creating the Ingest Pipeline

The following command will create an ingest pipeline names title-embedding that uses the model uploaded above.

PUT _ingest/pipeline/title-embedding
{
  "description": "Title embedding pipeline",
  "processors" : [
    {
      "text_embedding": {
        "model_id": "<model_id>",
        "field_map": {
           "title": "title_embedding"
        }
      }
    }
  ]
}

The field_map property describes the relationship between the text field that is provided, in this case is it movie titles, so I am using the title field. The title_embedding field is where the ingest pipeline will place the output of the model. The field_map can contain as many mappings as needed. Multiple models can be used by including more than one entry for processors.

This ingest pipeline can be used for more than one index, given that the field_map is applicable to both. It is currently not used, we need to tell OpenSearch to use this pipeline when indexing to one or more indices. This must be defined on the index settings.

Creating an Index

Use the following command to create an index that uses the title-embedding ingest pipeline.

PUT /movies
{
    "settings": {
        "index.knn": true,
        "default_pipeline": "title-embedding"
    },
    "mappings": {
        "properties": {
            "title_embedding": {
                "type": "knn_vector",
                "dimension": 384,
                "method": {
                    "name": "hnsw",
                    "engine": "lucene"
                }
            },
            "title": {
                "type": "text"
            }
        }
    }
}

We have to supply some information in the index settings and field mapping for OpenSearch to understand how to process and index documents.

index.knn: true enables k-NN search for this index.
default_pipeline: title-embedding specifies the ingest pipeline which should be used.
type: knn_vector is the field mapping type for title_embedding
dimension: 384 refers to the dimensionality of the model chosen. In this case, it is 384.
method in the future, other methods may be used, but these are currently supported by OpenSearch

Technically, we could stop here, index some movies, and search by specifying the model using a neural query, but that would require providing the model_id in each search request. This is less than ideal, as in most cases, we will want to ensure we are using the same model that was used for indexing. We will create a search template to make this easier.

Creating a Search Template

The following command will create a search template named neural_search_template. Make sure to replace the <model_id> with the model_id from earlier.

POST _scripts/neural_search_template
{
  "script": {
    "lang": "mustache",
    "source": {
      "from": "{{from}}{{^from}}0{{/from}}",
      "size": "{{size}}{{^size}}10{{/size}}",
      "query": {
        "neural": {
          "title_embedding": {
            "query_text": "{{query}}",
            "model_id": "<model_id>",
            "k": 200
          }
        }
      },
      "fields": [
        "title"
      ],
      "_source": false
    }
  }
}

Note the neural query. This is a new type of query which allows inline use of an ML model for encoding the query_text to be used when searching using k-NN on the title_embedding field.

*Note: If you'd like to learn about managing index mappings and search templates in production, you can read more about how I do it using migrations.

Adding Documents

As promised above, populate the movies index with some data. Note that documents contain only the title field.

POST /movies/_bulk
{ "index": {} }
{ "title": "Star Wars" }
{ "index": {} }
{ "title": "Lord of the Rings" }
{ "index": {} }
{ "title": "Spiderman" }
{ "index": {} }
{ "title": "Indiana Jones" }

If everything worked as expected, querying this index using a neural query or utilizing the search template above should return results based on the semantic similarity of the search term and the title.

Querying

Here is an example query using the search template created above.

GET movies/_search/template
{
  "id": "neural_search_template",
  "params": {
    "query": "lightsaber",
    "from": 0,
    "size": 1
  }
}

The expected result is that a single result is returned, and it should be the one that most closely relates to the search term.

{
  "took": 46,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": 0.5106329,
    "hits": [
      {
        "_index": "movies",
        "_id": "_d5tuYkBWLo6ADqA4lQb",
        "_score": 0.5106329,
        "fields": {
          "title": [
            "Star Wars"
          ]
        }
      }
    ]
  }
}

The result makes sense. "Lightsaber" is most closely related to "Star Wars" in the list of movies. We can continue querying to confirm it is working. Searching "web" returns "Spiderman", "dragon" returns "Lord of the Rings" and so on.

Hopefully, this is enough information to understand the basics of the Neural Search feature, and whether or not it would be useful for your applications. This is not intended to be an in-depth guide to k-NN search. If you are looking for more information on that topic, I highly recommend An Introduction to Statistical Learning published by Springer.

Using npm workspaces with Docker

Nathan Fries — Wed, 19 Jul 2023 00:00:00 +0000

Intro

Whether starting from an existing containerized application that is being brought into an npm workspaces environment, or containerizing an existing node monorepo using npm workspaces, you might encounter some idiosyncrasies relating to how node_modules and package-lock.json are handled by npm when using workspaces. This post will demonstrate the principles needed to ensure best practices are followed.

Incorrect configuration can result in dependencies not being shared where possible, unnecessarily large bundle sizes, or inconsistent package versions.

Dependencies for this project

Docker
npm >= 7.0
https://github.com/Npfries/docker-npm-workspaces

Example Code

The example repo discussed in this post contains two applications in the apps directory. Other applications would also be included in the apps directory if there were any more. Both applications have their own Dockerfile, as each application should be independently deployable. The two applications are simple http servers. The only notable difference between them is that one uses Express, and one uses Fastify. The Dockerfiles for both of them are identical in this case.

FROM node:18 AS base

WORKDIR /app

COPY package*.json ./

RUN npm ci

COPY . .

FROM base AS start

ENTRYPOINT ["npm", "run", "start"]

Nothing about this Dockerfile suggests anything out of the ordinary. Everything here would be applicable to most Node.js applications regardless of whether or not it was in an npm workspaces monorepo. The first step is copying both package.json and package-lock.json, installing dependencies with npm ci, copying the rest of the code into the image, and specifying an entrypoint for the container.

Using npm ci is a best practice here, as it uses the package-lock.json to verify that the packages are installed with consistent versions using checksums. Unfortunately, this doesn't work out of the box.

# This will fail.
docker build ./apps/project-a/ -t project-a

Building the image from the Dockerfile as normal will fail in this situation with an error:

 => ERROR [base 4/5] RUN npm ci
 0.3s
------
 > [base 4/5] RUN npm ci:
#0 0.284 npm ERR! code EUSAGE
#0 0.286 npm ERR!
#0 0.286 npm ERR! The `npm ci` command can only install with an existing package-lock.json or
#0 0.286 npm ERR! npm-shrinkwrap.json with lockfileVersion >= 1. Run an install with npm@5 or
#0 0.286 npm ERR! later to generate a package-lock.json file, then try again.
...

To understand why, and how to prevent this, lets take a look at how npm workspaces work at a high level.

Workspaces

npm workspaces is a built-in tool set for creating monorepos. Applications within the monorepo each have their own package.json, with their own dependencies, scripts, and other configurations specified per-project. The root of the monorepo also has a package.json file, and it gets a special property: workspaces. This is an array containing the paths to each application within this monorepo.

An application or library that is specified as a workspace is treated a little differently than a normal npm project. The biggest difference is that running npm install (and some other related commands) does not create a node_modules folder within the application directory, and it does not generate a package-lock.json file at that location either. Instead, both node_modules and package-lock.json are placed at the root of the monorepo. This way, shared dependencies between applications in the monorepo are installed once and shared to the projects.

Knowing this, the above error makes sense. There is no package.json in th application directories where the Dockerfile is.

Solution

We need to provide a package-lock.json file to the applications in order to run npm ci when building the image. The only one available to us is the one in the root of the monorepo, so we will temporarily copy it from the root to each application before building the image.

cp ./package-lock.json ./apps/project-a/

docker build ./apps/project-a/ -t project-a

rm ./apps/project-a/package-lock.json

You might wonder if that means that all dependencies of all projects are going to end up in each bundled application. The short answer is no, not if things are configured properly. Fortunately, npm is smart enough to handle this.

npm will not install all dependencies in the package-lock.json, instead, it will install only the dependencies in package.json, but use use package-lock.json to verify checksums of the specified versions.

We can verify this in the example codebase by checking the /app/node_modules/ directory in the containers and seeing that project-a has Fastify packages installed, and not Express, and project-b has Express dependencies, and not Fastify.

A Makefile is provided in the root of the repo for building and starting both services in Docker.

make start

In summary, this demonstrates a simple way to fix the npm ERR! code EUSAGE error when using Docker with npm workspaces. There may be alternative approaches, but this is the solution I've been using. This solution does not rely on any special code or configuration within the applications, and they could be safely be removed from npm workspaces without requiring changes to the Dockerfile or the package.json in each application.

Internal Developer Platform Misses the Mark

Nathan Fries — Sun, 02 Jul 2023 13:00:00 +0000

2023 seems to be the year of the Platform Team™️ and the Internal Developer Platform™️. DevOps is once again a process instead of a role, and PagerDuty is raking in the cash because everyone is a service owner and needs to be added to the on-call rotation.

Tools like Backstage and Port are enabling product teams to ship new microservices and microfrontends at blazingly fast speeds, armed with confidence that the platform team has blessed the Golden Path, deeming it secure, performant, and maintainable.

Since our developers are building mircroservices, which are inherently developed in isolation without exception, services owners are free to determine the development experience. External teams are never impacted by these decisions, and platform teams need not concern themselves.

Oops, dropped my /s.

If the project has reached any level of maturity, there are inevitably going to be some skeletons in the closet, some "temporary" coupling due to business needs, a shared database, something that prevents completely isolated development in all cases.

We have the Internal Developer Platform, encompassing the difference resources, internal tools, service and package skeletons, which may or may not be exposed by an Internal Developer Portal - but what about those legacy systems, or those tightly coupled services, those things we simply haven't had time to onboard to our portal or adopt into our platform? Who owns the developer experience there?

Perhaps our golden application skeleton, which can be provisioned and deployed in minutes is configured to work in isolation, encourages development in isolation. It is configured with integration testing, contract testing, and properly versioned. But those legacy portions, the parts configured by the last regime (which may or may not be the same regime having gone through a Platform Engineering rebrand) are not included in our IDP.

The local development experience is seemingly forgotten along the way. I think there is room for it in the Internal Developer Platform.

RabbitMQ Exchange and Queue Playground in Docker

Nathan Fries — Sat, 01 Jul 2023 00:25:00 +0000

Intro

If you are looking to get started with message queues, implement an event driven architecture, or are supporting services that rely on RabbitMQ for asynchronous communication, this post will provide you with an overview and sample code that will help you hit the ground running. If you are already familiar with RabbitMQ but want to understand more about how its load balancing or exchanges work, you can skip to the Playground Overview section near the bottom.

Dependencies for this project:

Project setup

I minimized the setup required to a single command, and the environment should start within seconds depending on network speed when pulling the RabbitMQ image.

git clone https://github.com/Npfries/rabbitmq-playground


make start

Running make start will bring up the services using docker compose up (with some specific arguments) internally.

RabbitMQ

RabbitMQ is a lightweight, flexible, and open source message broker that requires very little configuration. Queues and exchanges are asserted into existence by the applications publishing messages to, and consuming message from RabbitMQ.

There are a couple of components that are important to understand when working with RabbitMQ.

Exchanges
Queues

RabbitMQ exchanges are configurable brokers that take incoming messages, perform some filtering and routing, and publish to queues. There are several types of exchanges including direct, fanout, topic, and headers exchanges.

Exchange Type	Description
Direct Exchange	Pushes messages to a single queue. (default)
Fanout Exchange	Pushes messages to multiple queues.
Topic Exchange	Performs routing based on message topic.
Headers Exchange	Performs routing based on message header information.

RabbitMQ queues are simple message queues which can be bound implicitly or explicitly to RabbitMQ exchanges. An implicit bind is created between the default direct exchange when the amqplib channel method sendToQueue() is used. An explicit bind is created using the channel method bindQueue().

RabbitMQ can support multiple subscribers to the same queue, and requests will be load balanced between subscribers. If you wish to have multiple services react to the same message, a fanout exchange can be used to publish to multiple queues, and those services can subscribe to the queues individually.

In order to connect to a RabbitMQ instance using the amqplib npm package, the amqplib.connect() function is used.

const conn = await amqplib.connect(process.env.RMQ_HOST);

This creates a persistent connection to the RabbitMQ instance. From there channels can be created, which are containers for our different queue and exchange operations.

const ch1 = await conn.createChannel();

Queues and exchanges are defined in the application code, by asserting them into existence.

await ch1.assertExchange('name_of_exchange', '', { ... });
await ch1.assertQueue('name_of_queue');

Then the queue can be bound to the exchange.

await ch1.bindQueue('name_of_queue', 'name_of_exchange');

Alternatively, instead of explicitly asserting an exchange, the default direct exchange can be used simply by asserting a queue, and using the channel.sendToQueue() method.

await ch1.assertQueue('name_of_queue');
ch1.sendToQueue('name_of_queue', message);

This hides the implementation of the exchange, but an exchange (the default direct exchange) is used internally as an intermediary nonetheless.

When explicitly asserting an exchange, the channel.publish() method should be used.

await ch1.assertExchange('name_of_exchange', '', { ... });
await ch1.assertQueue('name_of_queue');
ch1.publish('name_of_exchange', '' message);

Here is a complete implementation demonstrating a fanout exchange, and the default direct exchange, utilizing two channels, and publishing a simple message to both exchanges, totalling three queues (two for the fanout, one for the direct). The messages are published once per 100 milliseconds.

In either case, the type of message should be a Buffer. This is often prepared by using Buffer.from(data).

// ./apps/sender/src/index.js

import amqplib from "amqplib";

(async () => {
  const exchange = "tasks_exchange";
  const queue1 = "tasks1";
  const queue2 = "tasks2";
  const queue3 = "tasks3";

  const conn = await amqplib.connect(
    process.env.RABBIT_MQ_HOST ?? "localhost"
  );

  const ch1 = await conn.createChannel();
  await ch1.assertExchange(exchange, "fanout", {});
  await ch1.assertQueue(queue1);
  await ch1.assertQueue(queue2);
  await ch1.bindQueue(queue1, exchange, "");
  await ch1.bindQueue(queue2, exchange, "");

  const ch2 = await conn.createChannel();
  ch2.assertQueue(queue3);

  setInterval(() => {
    const message = Buffer.from("something to do");
    ch1.publish(exchange, "", message);
    ch2.sendToQueue(queue3, message);
  }, 100);
})();

Since subscribers always consume from queues, not exchanges, the code for them is much more consistent across implementations.

// ./apps/receiver/src/index.js

import amqplib from "amqplib";

(async () => {
  /** @type {string} */
  // @ts-ignore
  const queue = process.env.QUEUE_NAME;
  const conn = await amqplib.connect(
    process.env.RABBIT_MQ_HOST ?? "localhost"
  );

  const channel = await conn.createChannel();
  await channel.assertQueue(queue);

  channel.consume(queue, (msg) => {
    if (msg !== null) {
      console.log("Received:", msg.content.toString());
      channel.ack(msg);
    } else {
      console.log("Consumer cancelled by server");
    }
  });
})();

Playground overview

The Node.js services provided are configured to communicate with RabbitMQ using the AMQP 0-9-1 protocol. There is a fantastic package, amqplib which we will be using as the client in our Node.js services. Speaking of services, here is are the services defined by the docker-compose.yml file:

# ./docker-compose.yml

version: "3.9"

services:
  sender:
    build:
      context: ./apps/sender/
    environment:
      - RABBIT_MQ_HOST=amqp://rabbitmq
    depends_on:
      rabbitmq:
        condition: service_healthy
    deploy:
      replicas: 1

  tasks1_receiver:
    build:
      context: ./apps/receiver/
    environment:
      - RABBIT_MQ_HOST=amqp://rabbitmq
      - QUEUE_NAME=tasks1
    depends_on:
      rabbitmq:
        condition: service_healthy
    deploy:
      replicas: 1

  tasks2_receiver:
    build:
      context: ./apps/receiver/
    environment:
      - RABBIT_MQ_HOST=amqp://rabbitmq
      - QUEUE_NAME=tasks2
    depends_on:
      rabbitmq:
        condition: service_healthy
    deploy:
      replicas: 1

  tasks3_receiver:
    build:
      context: ./apps/receiver/
    environment:
      - RABBIT_MQ_HOST=amqp://rabbitmq
      - QUEUE_NAME=tasks3
    depends_on:
      rabbitmq:
        condition: service_healthy
    deploy:
      replicas: 1

  rabbitmq:
    image: rabbitmq:management-alpine
    container_name: rabbitmq
    ports:
      - 15672:15672
    healthcheck:
      test: rabbitmq-diagnostics check_port_connectivity
      interval: 3s
      timeout: 30s
      retries: 3

There are two types of Node.js services included out of the box:

sender
receiver

The source code for the sender service is located in ./apps/sender/ and the source code for the three receiver services is shared, located in ./apps/receiver/. The sender, by default, is a single container producing messages to two exchanges:

tasks_exchange (fanout exchange)
default (direct exchange)

The tasks_exchange pushes messages to two queues:

tasks1
tasks2

The services defined in docker-compose.yml as tasks1_receiver and tasks2_receiver subscribe to tasks1 and tasks2, respectively.

The default direct exchange is used when the sender service sends messages to the tasks3 queue, to which the tasks3_receiver subscribes.

Starting the project spawns a single instance of the sender, as well as a single instance of each receiver. The number of senders or receivers can be increased by incrementing the replicas in the docker-compose.yml file from 1 to the number of desired instances. Increasing the number of replicas of any of the receivers is useful for observing the round-robin load balancing that RabbitMQ queues perform when there are multiple instances of a service subscribing to the same queue.

Note that messages sent to tasks_exchange will both be sent to the task1 and task2 queues, task1_receiver and task2_receiver are not load balanced between each other because the exchange is a fanout type, and the queues are distinct. Neither task1 or task2 queues are aware of the other.

To watch in realtime how RabbitMQ handles delayed acknowledgement of messages, how it load balances, and how messages are passed between exchanges and queues, you can adjust the number of replicas, modify the source code to send more messages, or experiment with different types of exchanges. The metrics for RabbitMQ can be observed in real-time by opening the management UI running on port 15672 (if the project is running locally).

If you make changes to the docker-compose.yml file, you will need to run either

make start

make dev

I recommend using make dev as it creates a volume mount to the source code and has a file watcher, so the container should be updated immediately when changes are made.

If you want to find out more about how I created this Docker local development environment, you can read about it here.

Docker for Next.js and Beyond with Hot-Swappable Containers

Nathan Fries — Mon, 26 Jun 2023 00:20:00 +0000

Demonstrating the stages of development, testing, deployment, and long term support with Docker multi-stage builds, Docker Compose, and GitHub Actions.

Intro

Hot-swappable containers are one of my favorite features of Docker, and one that is often not implemented for local development. This post should hopefully equip you with enough information for you to implement this, or something to similar effect in your own codebase (if you find it to be useful). If you read my previous post, Advanced Docker Patterns for Local Development, and feel like you don't need to revisit any of that information, you can skip to "Common Pitfalls."

I should probably define "hot-swappable containers." What I mean is the ability, through a single command, to swap out a running container with another.

Dependencies for this project

Docker
https://github.com/Npfries/docker-e2e-nextjs

This project was initially scaffolded with:

npx create-next-app@latest

To clone the repo:

git clone https://github.com/Npfries/docker-e2e-nextjs.git

Scenario

Scenario: I am a developer working on the User Service. I am happy with User Service, because it provides me with a "development mode" that reflects changes instantly thanks to a Docker volume binding the source code on the host to the container, and a file watcher that restarts the service when changes are made. I usually need the Search Service running in my local environment to facilitate testing the behavior of the two services working together. I don't usually need to make changes to the Search Service, so I have been happy to just pull the latest prebuilt image from the image repository. Something happens, and now I need to make a change to Search Service. How can I quickly swap this Search Service container with a "development mode" container, like I have been using for User Service.

In this scenario, wouldn't it be great if there was a single command I could run like:

# Search Service
> make dev # Removes the lightweight container
                        # Starts a dev mode container

This would replace my existing Search Service container with a development mode container, built from source, and with the same file watcher and dev server setup as User Service? And hopefully the inverse can be extended to User Service (use a lightweight container without symlinks and file watchers) since I am not going to be make changes to it for the time being.

# User Service
> make start_local # Removes the dev mode container
                        # Starts a lightweight container

By the end of this, hopefully you will feel equipped to bring these features to your Next.js (or any) codebase.

Common pitfalls

Often, engineering teams already utilizing Docker for local development fall into one of two camps:

Many containers running development servers and file watchers (next dev)
Many containers serving built version of the app that must be rebuilt when changes are made (next build && next start)

In the case of many development servers, you generally also have a volume for passing the source code from the host to the running container. The development server is watching the mounted source code files within the container, and will restart when changes are detected. This is good for us developers, as our changes are immediately reflected within our containerized application.

Unfortunately, this approach does not come without its tradeoffs:

Resource contention: When you have multiple Docker bind mounts, file watchers, and development servers running concurrently, you can quickly reach resource limits of both Docker and your file system.
Managing version control: Mounting the source code on the host can mean ensuring you are always pulling latest changes down in individual repositories unless you are working in a monorepo.

In the case of built services (next start), they usually come in two flavors: prebuilt images pulled from the project's image repository, or images built from source on the host machine. Images pulled from an image repository have the advantage of not needing to be rebuilt on each host, speeding up the time required to start the service, and alleviating the concern of working on old version of the image. When you need a specific branch, or have local changes that should be reflected in the container, the image can be rebuilt from source whenever necessary.

This also carries some downside:

Changes to the source code require the image to be rebuilt, or a development server to run on the host (this is not always possible in complex projects, such as when the container expects to be within the Docker network or served by a reverse proxy).

There is no good reason that each method can not be supported, and even selectively utilized on a per-container basis. This is a best-of-both-worlds scenario where containers can be toggled from three different "modes" in this project: "dev", "start_local", and "start_latest". Defining each "mode":

"start_latest": The latest image is pulled from the image repository, with an entrypoint of "next start"
"start_local": The image for the app is built from source, with an entrypoint of "next start"
"dev": The image for the Next.js app is built from source, with an entrypoint of "next dev" and the project source code is mounted from the host to the container.

Ideally, this should be implemented in a way that doesn't require the user to have a large amount of knowledge or context within the project. On the other hand, this should not be implemented in a way that reduces flexibility. Escape hatches should be provided. Things to avoid:

Hiding the Dockerfile, docker-compose.yml or other required files away in the hopes that the user never needs to leave you abstraction.
Creating a merely implicit contract between repositories.

In my prior post I outlined how this can be achieved with with a simple project. I wanted to follow up on that and implement a more comprehensive end-to-end demo using a common framework, Next.js, and also demonstrate how this would be implemented into a more complex deployment process. Unlike the last project, this one also utilizes GitHub Actions to push our latest image to my GitHub image repository.

Building the image in CI

I am using GitHub actions to build the image for the Next.js application and push it to the image repository. This was achieved by utilizing one of the prebuilt "Docker" GitHub Actions suggested when I opened the GitHub Actions tab in GitHub.

name: Create and publish a Docker image

on:
  push:
    branches: ['main']

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  build-and-push-image:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write

    steps:
      - name: Checkout repository
        uses: actions/checkout@v3

      - name: Log in to the Container registry
        uses: docker/login-action@65b78e6e13532edd9afa3aa52ac7964289d1a9c1
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata (tags, labels) for Docker
        id: meta
        uses: docker/metadata-action@9ec57ed1fcdbf14dcef7dfbe97b2010124a938b7
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}

      - name: Build and push image
        uses: docker/build-push-action@f2a1d5e99d037542a71f64918e516c093c6f3fc4
        with:
          context: .
          target: start
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}

I am targeting a specific stage in the multi-stage Dockerfile for the project, "start". This is the image that will be pulled when running the "start_latest" script in the provided Makefile. This stage is also targeted when running the "start_local" script, however the stage is target instead in the docker-compose.start.yml file.

After this image was built and pushed to the image repository, I updated the docker-compose.yml file to use this image. Now, when a plain docker compose up -d is executed, the image is pulled and the container is started without needing to have any knowledge about the project. If someone simply wants to start the service in order to test it or run it alongside other services being developed, they can use standard Docker commands. This is a useful escape hatch. It also allows the image to be pulled when running the make start_latest script, without providing a different image in the docker-compose.start.yml file.

Helper scripts

I provided several helper scripts that run Docker commands achieving the different modes in which we can run the service. Each of the scripts starts with the base docker-compose.yml file.

# docker-compose.yml
version: "3.9"

services:
  next_app:
    image: ghcr.io/npfries/docker-e2e-nextjs:main
    build:
      context: ../
      # target: # this is specified in the other compose files
    ports:
      - 3000:80

This is a pretty bare-bones docker-compose.yml file! Something a bit out of the ordinary is the port mapping. You might be accustomed to the service in a development container being some non-80 port, but in this case it is 80. It is simpler to be consistent with port mapping, so that the mapping can be defined on the base docker-compose.yml. Because the httpd image is used for prod_static, and is served from port 80, the mapping needs to be consistent. Remember, you cannot override configs by passing multiple docker-compose.yml files to Docker commands, they are additive only.

# Makefile
start_latest:
    docker compose -f ./docker/docker-compose.yml -f ./docker/docker-compose.start.yml -p my_project up -d --no-build --pull always

# docker-compose.start.yml
services:
  next_app:
    container_name: "next_app-start"
    build:
      target: start

I am passing in, in addition to the docker-compose.yml file, a docker-compose.start.yml. This file overrides the container name, appending "-start". This is useful for knowing in which mode the currently running service is, without impacting the hot-swappability of the container.

As for the additional arguments, --no-build --pull always, these prevents the file from being built locally, ensuring that the script isn't lying to the user, and is always using the latest image in the repository. By default, if there was an image available in the remote repository, it would pull from that, but fall back to building.

# Makefile
start_local:
    docker compose -f ./docker/docker-compose.yml -f ./docker/docker-compose.start.yml -p my_project up -d --build

# same docker-compose.start.yml above

This uses the same docker-compose.yml files as above, but uses the --build argument in order to force Docker to build the image from source. As mentioned above, the default behavior would be for Docker to first attempt to pull the image from the remote image repository, which in this case, I don't want.

# Makefile
dev:
    docker compose -f ./docker/docker-compose.yml -f ./docker/docker-compose.dev.yml -p my_project up -d --build

# docker-compose.dev.yml
services:
  next_app:
    container_name: "next_app-dev"
    build:
      target: dev
    volumes:
      - ../:/app/

Instead of the docker-compose.start.yml, this one uses docker.compose.dev.yml. This file overrides the target stage to the dev stage of the Dockerfile, and also creates a volume providing the source code from the host to the container. When changes are made to the source code on the host, the symlink that the volume creates reflects the changes to the container, and the file watcher from next dev updates the running application. This way, live reloading works just as it does by running npm run dev in a Next.js project running outside of Docker.

prod:
    docker compose -f ./docker/docker-compose.yml -f ./docker/docker-compose.prod.yml -p my_project up -d --build

prod_static:
    docker compose -f ./docker/docker-compose.yml -f ./docker/docker-compose.prod_static.yml -p my_project up -d --build

# the additional docker-compose.yml files target the "prod" and "prod_static" stages, respectively.

These two scripts are largely interchangeable for local development, but there is a significant difference when it comes to deployment. This will be detailed later when examining the Dockerfile.

There are some downsides to all this container hot-swapping. Between our multi-stage builds, and resulting images being pulled and built frequently, we end up with a lot of "dangling" images. This will quickly fill up a hard drive, and should be cleaned up frequently. I have a small confession to make. All of the scripts above have a second line: docker rmi $$(docker images -f "dangling=true" -q). This cleans up those dangling images. However you choose to implement this feature, you should plan to deal with this somehow.

Dockerfile

FROM node:18 AS base

WORKDIR /app

COPY package*.json ./

RUN npm ci

COPY . .

FROM base AS builder

RUN npm run build

CMD []

FROM base AS builder_static

COPY next.static.config.js /app/next.config.js

RUN npm run build

CMD []

FROM builder AS start

ENTRYPOINT ["npm", "run", "start:docker"]

FROM base AS dev

ENTRYPOINT ["npm", "run", "dev:docker"]

FROM node:18-alpine AS prod

WORKDIR /app

COPY --from=base /app/package*.json ./
COPY --from=base /app/node_modules/ ./node_modules/
COPY --from=base /app/public/ ./public/
COPY --from=builder /app/.next/ ./.next/

ENTRYPOINT ["npm", "run", "start:docker"]

FROM httpd:2.4 AS prod_static

COPY --from=builder_static /app/out/ /usr/local/apache2/htdocs/

This is not a comprehensive guide to multi-stage builds, but essentially what is happening here is defining multiple stages that can be targeted with the --target argument of docker build, or within a docker-compose.yml file. Docker will build that stage, and any stage that it inherits from (the FROM statement). It will not necessarily execute from top to bottom, but instead can skip any stages that are not dependencies of the target stage.

I have tried to minimize duplicate work, and adhere to some best practices for production stages. For example the prod_static stage does not inherit from base or builder, but instead only copies the produced artifacts from the builder_static stage. This reduces the surface area of potential attack vectors, by not including unnecessary code or tools in the final image.

Here is a chart to illustrate the stages that will be utilized by each Makefile script.

	start_latest	start_local	dev	prod	prod_static
base	✅	✅	✅	✅	✅
builder	✅	✅		✅
builder_static					✅
start	✅	✅
dev			✅
prod				✅
prod_static					✅

Try it out

If you would like to try out the commands, clone or fork the repo and run:

> make start_local

After the build completes, checking the running containers should yield a container next_app-start.

> docker ps

CONTAINER ID IMAGE NAMES
afd2e26dd86a ghcr.io/npfries/... next_app-start

Switching to development mode:

> make dev

You should see some output about the build, and lastly some output about the prior dangling image being cleaned up. If you want to see what happens without the cleanup scripts, you can remove them and watch the dangling images pile up for fun.

> docker ps

CONTAINER ID IMAGE NAMES
7406f5af332a ghcr.io/npfries/... next_app-dev

Notice that the container name was changed to next_app-dev. However, we never explicitly told Docker to remove the old next_app-start. Docker did this automatically because as far as Docker is concerned, they are the same service. They share the same top-level service definition in the docker-compose.yml files, so it automatically shuts down the old container when the new one is built and ready to be started. This avoids any port mapping conflicts or other potential issues.

My motivation for writing these posts about Docker is a perceived gap in knowledge and utilization of Docker. Either the Dockerfile in a repo tends to be owned by a DevOps team, focused on CI and deployment, or by the dev team alone, focused on local development workflows. I hope that this example provides some ideas around how a great local development experience can be maintained alongside complex deployment requirements.

Feel free to use the ideas or code in your own projects. If you have any questions or feedback, feel free to email me at me@nathanfries.com.

Advanced Docker Patterns for Local Development

Nathan Fries — Thu, 22 Jun 2023 01:34:00 +0000

How multi-stage builds, yaml merges, and some creativity can overcome the complexity that Docker brings to local environments for larger organizations.

Intro

Often, Docker is suggested for local development to simplify dependency management, provide isolation and reproducibility, and simplify architecture differences between environments. Most examples do not go into detail about how Docker can be introduced without significantly impacting workflow. Various blogs and YouTube channels make it seem as though simply adding a 5-6 line long Dockerfile and running docker run ... will satisfy most requirements. That is far from the case, and Docker, if introduced improperly, will cause more headaches than problems it solves.

Dependencies for this project:

Docker
Node
https://github.com/Npfries/advanced-docker-local-development.git

For this post, I will frequently referring to the repository containing a "basic" configuration that does actually solve some of the problems that I have encountered trying to implement Docker into my local development workflow for existing projects.

The project largely consists of two Docker services

version: "3.9"

services:
  my_service:
    image: my_service
    container_name: my_service
    build:
      context: .
      target: ${DOCKER_STAGE}
    environment:
      - PORT=3000
      - DB_USER=${DB_USER}
      - DB_PASSWORD=${DB_PASSWORD}
      - DB_URL=${DB_URL}
    ports:
      - 3000:3000
    depends_on:
      - mariadb
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000"]
      start_period: 5s
      interval: 5s
      timeout: 5s
      retries: 55

  mariadb:
    image: mariadb:jammy
    container_name: mariadb
    environment:
      MYSQL_ROOT_PASSWORD: ${DB_ROOT_PASSWORD}
      MYSQL_DATABASE: db
      MYSQL_USER: ${DB_USER}
      MYSQL_PASSWORD: ${DB_PASSWORD}
    volumes:
      - ./data:/var/lib/mysql
    ports:
      - "3306:3306"
    healthcheck:
      test: mysqladmin ping -h 127.0.0.1 -u $$DB_USER --password=$$DB_PASSWORD
      start_period: 5s
      interval: 5s
      timeout: 5s
      retries: 55

It also contains a Dockerfile at the root of the project from which my_service will be built, with the source code contained in src/. It is a Node based project, and the package.json file contains helper scripts for executing some of the longer docker commands needed.

Overview

The main application, my_service, is a simple Express server, with only two endpoints: a health check, and an endpoint at / that checks that we can connect to the database for no reason in particular other than to prove it works.

import express from "express";
const app = express();
import { PrismaClient } from "@prisma/client";

const PORT = process.env.PORT ?? 3000;

app.get("/", async (req, res) => {
  const prisma = new PrismaClient();
  await prisma.$connect();
  await prisma.$disconnect();
  res.send("Hello, World!");
});

app.get("/healthcheck", (req, res) => {
  res.send({ health: "healthy" });
});

app.listen(PORT);
console.log(`App listening on port: ${PORT}`);

There are two docker compose files, docker-compose.yml and docker-compose.dev.yml, a single multi-stage Dockerfile, and three entrypoint shell scripts that will be used by the three usable stages in the Dockerfile.

I have also provided a .env file that Docker will use when using docker compose commands.

DOCKER_STAGE=start
DB_USER=user
DB_PASSWORD=password
DB_ROOT_PASSWORD=password
DB_URL=mysql://root:password@mariadb:3306/db

Start:

There are two "modes" in which the project can be "started". Many developers might be familiar with the "start" vs. "dev" pattern for local development. What I mean by "start" is a simple lightweight process just executing the code. A "dev" command usually implies some additional developer tooling, like a file watcher and automatic restart, maybe some hot module reloading, etc. Our first complication is that when you introduce containers you have a new decision tree. What are we starting? A node process on our host machine? A prebuilt container image that is pulled from some image repository? A container image that should be built from our local repository? Great question, and one that is only likely asked after Docker has been approved and chosen for use.

The package.json file provides three start scripts:

npm run start
npm run start:docker:latest
npm run start:docker:local

The first, npm run start, is the most straightforward.

node ./src/index.js

This is the sort of script you would typically find in most package.json files created from any boilerplate or framework that isn't using Typescript. Since we're using Docker, we won't be calling this script ourselves. No, it will be reserved for our containerized application.

The second, npm run start:docker:latest, is much more involved, and is one of the scripts we should expect to use directly, or ideally in larger systems, have some outside manager call.

docker compose up --no-build --pull always -d

We're doing some creative things here. We are explicitly telling Docker to not build this image from source, and forcing it to pull the image from a remote repository, even if one is locally available. The purpose of this script is for developers who need to have this service locally, but do not need to modify it at present. Ideally this image is available on a remote repository, pushed by some pipeline automatically on commits or merges to the main branch.

The last, npm run start:docker:local is a little simpler.

docker compose up --build -d

This tells docker to indeed build the image from source. This should be used when a developer again does not need to develop this service, but perhaps would like to make a simple change, or check out a specific branch, or a remote image is not available for any reason.

We prefer to use start scripts over dev scripts in general when a service is not being developed. It might not be critical for a single or a handful of services, but when we start to talk about ten, twenty, or more microservices and microfrontends, plus databases, reverse proxies, message queues, etc. those volume mounts and file watchers really weigh down a system. Docker is powerful but its performance is almost entirely dependent on the workload being performed.

Dev

Moving on to the dev scripts, we have two.

npm run dev
npm run dev:docker

The first, npm run dev, like the first start script, is going to now be reserved for the containerized application. Keep in mind, there is nothing stopping a user from calling this script, and it will indeed work, but what about all those other services you wanted Docker for in the first place?

nodemon ./src/index.js

Just a simple file watcher and node execution.

The last script we need to talk about is npm run dev:docker. This is where the real magic happens.

docker compose -f docker-compose.yml -f docker-compose.dev.yml up --build -d

Note that we are again using --build in order to force docker to build the image from source. I can't really imagine wanting to use a dev server on an image from a remote repository. If there is a use-case I've overlooked please reach out and let me know.

Additionally, we are passing in two .yml files for the configuration. Passing in multiple files like this will perform a merge on the files. It is a "deep" merge, where top level keys will not be overwritten, but merged at the deepest level possible. This is important because we would not like to completely duplicate a service, but would like to define new values or overwrite specific keys in the former config.

services:
  my_service:
    build:
      target: dev
    volumes:
      - "./src/:/app/src/"

The docker-compose.dev.yml specifies a specific target for the build. This is targeting the dev stage of our multi-stage Dockerfile. This stage specifies the entrypoint-dev.sh which includes our first dev script, npm run dev. This additional docker compose file also includes a new volume, so that any changes made to the /src directory are reflected in the container, and the file watcher from nodemon will pick up the changes and restart the server.

This is what the merged file would look like if it was exposed to the user:

version: "3.9"

services:
  my_service:
    image: my_service
    container_name: my_service
    build:
      context: .
      target: dev
    volumes:
      - "./src/:/app/src/"
    environment:
      - PORT=3000
      - DB_USER=${DB_USER}
      - DB_PASSWORD=${DB_PASSWORD}
      - DB_URL=${DB_URL}
    ports:
      - 3000:3000
    depends_on:
      - mariadb
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000"]
      start_period: 5s
      interval: 5s
      timeout: 5s
      retries: 55

// rest of docker-compose.yml

Speaking of multi-stage builds, let's take a look at the Dockerfile.

FROM node:20-alpine as base

WORKDIR /app/
COPY package*.json /app/

FROM base as start

RUN npm ci
COPY . .
CMD ["sh", "./scripts/entrypoint-start.sh"]

FROM base AS dev

RUN npm i
COPY . .
CMD ["sh", "./scripts/entrypoint-dev.sh"]

FROM base AS prod

RUN npm ci
COPY . .
CMD ["sh", "./scripts/entrypoint-prod.sh"]

Note that each stage inherits from base. That step is common between all of them, but I made a small change to the dev stage to justify some other difference beyond the entrypoint because I thought somehow it might be more clear.

The various entrypoints are generally the application start scripts, along with anything that would normally need to happen. If you were deploying this node app outside of a container, you might run some migrations on a database based on a schema in your ORM, tied to your application deployment. I tried to demonstrate that here, despite the application not actually utilizing any imaginary data in the database, but wanted to show how it might work.

Note: If you want to see a more complete example using Next.js and multiple deployment options, you might want to check out my follow-up article: Docker for Next.js and Beyond with Hot-Swappable Containers

entrypoint-dev.sh:

#!/bin/sh

npx prisma migrate reset --force

npm run dev

entrypoint-start.sh:

#!/bin/sh

npx prisma migrate reset --force

npm run start

entrypoint-prod.sh:

#!/bin/sh

npx prisma migrate deploy

npm run start

Conclusion

I actually quite like this setup. It might be a little enterprise-y, but I feel that it is quite clear what is happening once you have a decent understanding of Docker, and how different use-cases can be approached without too much duplication of configuration or scripts.

I plan to expand on this in order to demonstrate how tooling for larger organizations could utilize these patterns in internal tooling, in order to orchestrate a a large and complex environment in a future post.

If you hated this and want to let me know, or if you have any questions, please email me at me@nathanfries.com.

Unsupported: Opensearch Migrations

Nathan Fries — Mon, 19 Jun 2023 00:38:00 +0000

Intro

MySQL, Postgres, and MongoDB all have something in common: numerous ORMs and open source tools that provide migrations as a first class citizen. Why? Because they are extremely powerful for ensuring:

Changes are tracked in version control
Reproducibility
Ability to quickly revert changes

Opensearch (and Elasticsearch), for one reason or another, does not share this trait with other data stores. I won't be exploring the reasons behind this, but instead will be focusing on trying to convince you that migrations for Opensearch are beneficial and worth integrating into your workflow for application search.

Background

Opensearch usage is generally classified into two distinct buckets:

Static data
Dynamic data

Elasticsearch, and Opensearch by extension, would classify things like application search (providing search functionality to users within your application) as "static" while things like time-series data such as logs would be considered "dynamic." The distinction is not so important when it comes to searching the data in tools like Kibana or Opensearch Dashboards, but more-so considering how that data will be searched. I will be focusing on "Static" data for this post, as this is the most likely scenario where users may realize the fragility of most Opensearch configurations.

When we integrate our applications with databases APIs, we expect some stability and guarantees around expected - and actual - state. We reach into our developer toolboxes for tools such as integration tests, contract tests, migrations, etc. to provide confidence when deploying our applications. For example, using popular ORMs like Sequelize, TypeORM, or Prisma provides out-of-the-box migration functionality. Running a migration for creating a new table is as simple as defining a model and running some variant of a migrate up command. Regardless of the state of the database or environment, we can be assured that when the command completes successfully the database will be in the expected state.

Problem

Opensearch and Elasticsearch have limited (or no) support from most well-known ORMs. Any support at all is uncommon, and full support for defining models and executing migrations is rarer still. Frequently configurations such as index settings, mappings, analyzers, search templates, index templates, aliases, etc. are managed by making changes via the REST API, or the Kibana or Opensearch Dashboards Dev Tools.

Imagine making schema changes in MySQL by manual operation by connecting to the database with MySQL Workbench, or needing to define models in a bash script with cURL commands. You have no guarantees that drift hasn't occurred, and often no simple way to safely test changes, or even roll back when incorrect changes are deployed. It doesn't need to be this way.

Elasitcsearch and Opensearch might seem closely related to other document databases, however its complexity often results in schemas that are just as rigid as a structured database. Defined field mappings prevent dissimilar types across documents, as do dynamic mappings. When we need to make a change to an existing mapping, we must reindex. That can be a scary operation, and not one personally I would like to do manually.

When changes are inevitably made, we would like to perform some automated testing. I tend to rely heavily on integration tests in short lived ephemeral environments in most of my processes, and a simple App -> Opensearch integration test should not be too complicated. However, IaC and quick deployment of Opensearch can be somewhat cumbersome. We can quickly get a working Docker project spun up with Opensearch to perform the tests, however it will be a blank slate. How do we get to the point where our various index configurations are applied consistently in a continuous integration pipeline?

Migrations to the rescue

Database Migrations should be considered table stakes for serious projects, and Opensearch should be no exception. If you are familiar with Opensearch Dashboards or Kibana, the following syntax may be familiar to you.

PUT my_index
{
  "mappings": {
    "properties": {
      "name": "text",
      "description": "text",
      "timestamp": "date"
    }
  }
}

Elasticsearch and Opensearch use what they call Query DSL. It is a JSON-like syntax, with syntactic sugar for defining the REST methods and endpoints. In this example, maintainers might expect to use this within the Dev Tools of either Opensearch Dashboards or Kibana. Without migrations, where would this be tracked, versioned, or code review? Perhaps we have a DBA that that runs stored commands in our repository, perhaps we have this formatted as a cURL command in a bash script we can run against the host, or perhaps some other solution.

There is no reason we could not adopt patterns from other database tooling, and implement migrations ourselves if our current tooling does not support it. In another post, I go through the process of implementing a migration CLI for consuming migrations written in Javascript using the Opensearch client library, however this could be extended to support migrations written in Query DSL.

Once we have some rudimentary migration tooling in place, we can execute migrations with commands as simple as migrate up or migrate down in order for our environments to be configured or torn down, respectively.

Now, implementing integration tests in our continuous integration pipeline is just a matter of spinning up a small Elasticsearch or Opensearch host, executing a migrate up command, and running our test suite. We can tear the environment down, with the confidence that our state is managed externally, and we can recreate it any time we wish.

DBAs and developers no longer need to worry about a copy-paste mishap, and we can sleep easy at night.

Opensearch Migrations for Node.js Applications

Nathan Fries — Sun, 18 Jun 2023 19:20:00 +0000

Intro

Opensearch provides no functionality for managing the state of a cluster with respect to index templates, index mappings, analyzers, and search templates. In a typical database, we would expect to find some tooling for migrations, usually provided by an ORM. The Opensearch client library does not provide anything of the sort, and ORMs either do not support Opensearch, or do not have migration tooling that supports it.

The code for this post can be found at https://github.com/Npfries/opensearch-umzug. It contains a docker-compose.yml file with basic development Opensearch and Opensearch Dashboards containers for testing.

Outline

Planning
- Umzug migrations framework for Javascript projects
Implementation
- Abstract CustomStorage class
- Opensearch REST API vs Typescript client
- Providing executed migrations
- Logging migrations
- Unlogging migrations
- Creating migrations
Summary

Planning

Implementing good migration tooling from scratch is not trivial enough for many teams to consider. Fortunately, we do not have to start completely from scratch. Taking a look at Sequelize , a popular open source ORM for Javascript created by engineers at Microsoft, we can see that they also published the Open source Umzug migration framework, which powers the migration tooling provided by Sequelize. We can leverage this existing tooling and extend it to suit our needs. This will save a considerable amount of work, with the added benefit of behaving similarly to Sequelize.

Implementation

Umzug supports multiple databases out of the box, each implementing the abstract UmzugStorage class. This class is exposed by the framework, and can be provided to the Umzug constructor as the customStorage property on the configuration. We can create our own OpensearchStorage that implements UmzugStorage. Implementations of UmzugStorage must implement three methods:

executed() - this should return a list of executed migrations.
logMigration() - this should log the migration.
unlogMigration() - this should unlog the migration.

// OpensearchStorage.js
import { UmzugStorage } from "umzug";

class OpensearchStorage implements UmzugStorage {
    async executed() {}
    async logMigration(params) {}
    async unlogMigration(params) {}
}

export { OpensearchStorage };

Another thing to consider when using Opensearch with Typescript codebases, is whether or not you prefer to use the Opensearch Javascript client library, or to use the exposed REST API. It is typical to use the Opensearch client, as it provides a pleasant and *mostly typed abstraction for working with the Opensearch REST API.

On one hand, the REST API is relatively stable, and the request format will be nearly identical to the QDSL syntax provided by Opensearch. We need to check for breaking changes to the REST API before upgrading Opensearch, as we will not receive deprecation warnings in our IDE.
On the other hand, the Opensearch client library is easy to use, and we will receive deprecation warnings in our IDE when upgrading the client, but we may need to consider frequency of package updates to keep the library up to date with our Opensearch cluster.

We will be choosing the Opensearch client library for convenience. Keep in mind that the choice need to be the same for our implementation of UmzugStorage and the actual migrations themselves, which we will get to eventually. The team maintaining this implementation may differ from the team or teams maintaining the actual migrations, and they may have differing needs or preferences.

Before we can implement the OpensearchStorage class, we will need to prepare the Opensearch client. We will be importing it from @opensearch-project/opensearch.

// OpensearchClient.js
import { Client } from "@opensearch-project/opensearch";

class OpensearchClient extends Client {
    constructor() {
        super({
            node: process.env.OPENSEARCH_HOST,
        });
    }
}

export { OpensearchClient };

We can then consume the client in OpensearchStorage.

// OpensearchStorage.js
import { UmzugStorage } from "umzug";
import { OpensearchClient } from "./OpensearchClient.js";

class OpensearchStorage implements UmzugStorage {
    client: OpensearchClient;

    constructor() {
        this.client = new OpensearchClient();
    }

    async executed() {}
    async logMigration(params) {}
    async unlogMigration(params) {}
}

export { OpensearchStorage };

Now we are ready to implement the methods on OpensearchStorage. Starting with executed, this method does not take any arguments, and should return all previously executed migrations as an array of strings representing the names of each applied migration.

async executed() {
  const migrationsIndexExists = (
    await this.client.indices.exists({ index: 'migrations' })
  ).body

  if (!migrationsIndexExists) {
    await this.client.indices.create({ index: 'migrations' })
    return []
  }

  const respose = await this.client.search({
    index: 'migrations',
    body: {
      query: {
        match_all: {}
      },
      size: 100
    }
  })

  const result = response?.body?.hits?.hits?.map(m => m['_source']['name']) ?? []

  return result
}

Let's break down what is happening here.

First, we are checking to see if the migrations index exists. If not, we go ahead and create it, and we can assume at this point that no migrations have yet been applied, so we do an early return with an empty array. This method will be called before the others, so this might be good place to do the migration index creation.
Next, we perform a search with a match_all query, in order to return all results, and we set the size to some arbitrarily high number, something higher than will be the foreseeable number of migrations. This can always be adjusted to suit needs, but should be considered, since accidentally truncating results here would result in migrations being run more than once.
Finally, we map the results to an array containing only the names of the migrations, which is our desired return value for this method.

Moving on to logMigration, this method should do what its name suggests, and log the migration to the migrations index we created in executed.

async logMigration(params) {
  await this.client.index({
    index: 'migrations',
    body: {
      name: params.name,
      timestamp: new Date().toISOString()
    },
    refresh: true
  })
}

This method is much simpler, as all it needs to do is log the migration to the migrations index. We include a timestamp field for tracking when a particular migration was completed. We add refresh: true in order to wait for the record to be queryable before returning, otherwise it is possible to get into situations where calling executed shortly after running a migration will result in duplicate migration executions. This is most likely to happen during integration tests rather than normal migration execution, but we include it to be safe. Lastly, we can implement unlogMigration.

async unlogMigration(params) {
  await this.client.deleteByQuery({
    index: 'migrations',
    body: {
      query: {
        bool: {
          filter: [
            {
              term: {
                name: params.name
              }
            }
          ]
        }
      }
    }
  })
}

This is almost as simple as logMigration, however we must perform a filter query in order to delete the correct migration. We can put it all together and have a complete implementation of the UmzugStorage class.

// OpensearchStorage.js
import { OpensearchClient } from "./OpensearchClient.js";

class OpensearchStorage {
    client;

    constructor() {
        this.client = new OpensearchClient();
    }

    async executed() {
        const migrationsIndexExists = (await this.client.indices.exists({ index: "migrations" })).body;

        if (!migrationsIndexExists) {
            await this.client.indices.create({ index: "migrations" });
            return [];
        }

        const response = await this.client.search({
            index: "migrations",
            body: {
                query: {
                    match_all: {},
                },
                size: 100,
            },
        });

        const result = response?.body?.hits?.hits?.map((m) => m["_source"]["name"]) ?? [];

        return result;
    }

    async logMigration(params) {
        await this.client.index({
            index: "migrations",
            body: {
                name: params.name,
                timestamp: new Date().toISOString(),
            },
            refresh: true,
        });
    }

    async unlogMigration(params) {
        await this.client.deleteByQuery({
            index: "migrations",
            body: {
                query: {
                    bool: {
                        filter: [
                            {
                                term: {
                                    name: params.name,
                                },
                            },
                        ],
                    },
                },
            },
        });
    }
}

export { OpensearchStorage };

In order to use OpensearchStorage when we are executing migrations, we should pass it to an instance of Umzug. This is is usually going to be used in a CI environment, or run by a user in a local environment. As such, we will set up Umzug to be run as a script, instead of our application code directly.

// migrate.js
#!/user/bin/env node
import { OpensearchStorage } from './OpensearchStorage.js'
import { OpensearchClient } from './OpensearchClient.js'
import { Umzug } from `umzug`

const client = new OpensearchClient()

const umzug = new Umzug({
  migrations: {
    glob: 'migrations/**/*.cjs'
  },
  logger: console,
  context: client,
  storage: new OpensearchStorage()
})

umzug.runAsCLI()

Let's name this file migrate.js. Umzug provides a helpful runAsCLI() helper that will parse arguments for us, saving a bit of effort. By default, Umzug is able to execute javascript migrations without any additional work. This could alternatively be modified by generating the migrations and passing in an array of migrations generated from any arbitrary source. Let's go ahead and create our first migration. Umzug expects two exports for each migration: an up step, and a down step. As was mentioned above, we will be using the Opensearch client for both the migration tooling, as well as for the migrations themselves.

// [date]_[name].cjs
module.exports = {
    async up({ context: client }) {
        await client.indices.create({
            index: "test_index",
        });
    },
    async down({ context: client }) {
        await client.indices.delete({
            index: "test_index",
        });
    },
};

Running node ./migrate.js up should result in executing any migrations in the migrations directory.

Summary

In summary, we configured a custom storage implementation for Opensearch to provide to Umzug, our migration tool of choice. We opted to use the Opensearch client library over the REST API for simplicity. We then configured Umzug to run as a CLI after extending it with our custom storage. This is a great location to further extend the tooling to support multiple environments and hosts, such as in the case of mirrored environments. Finally, we created our first Opensearch migration. In the future, I may explore implementing an adapter to support QDSL migrations, in order to provide interoperability between Opensearch Dashboards Dev Tools and our migrations tooling.