DEV Community: Felipe Malaquias

Using Docker in AWS Amplify Builds: A Step-by-Step Guide

Felipe Malaquias — Sun, 27 Apr 2025 20:14:01 +0000

After nearly a year of happily using Amplify Gen2, I started facing problems as soon as I added a Docker image asset to my project.

I had something along these lines:

const cluster = new ecs.Cluster(stack, 'ECSCluster', {
  clusterName: 'ECSCluster',
  vpc: vpc,
});

const taskDefinition = new ecs.FargateTaskDefinition(stack, 'TaskDefinition', {
  cpu: 4096,
  memoryLimitMiB: 16384,
  runtimePlatform: {
    cpuArchitecture: ecs.CpuArchitecture.X86_64,
    operatingSystemFamily: ecs.OperatingSystemFamily.LINUX,
  },
})

const dockerImageAsset = new DockerImageAsset(stack, 'DockerImageAsset', {
  directory: resolvePath('./docker'),
  platform: Platform.LINUX_AMD64,
});
// see https://aws.amazon.com/blogs/aws/aws-fargate-enables-faster-container-startup-using-seekable-oci/
// and https://github.com/aws/aws-cdk/issues/26413
SociIndexBuild.fromDockerImageAsset(stack, 'Index', imageProcessorDockerImage);

taskDefinition.addContainer('Container', {
  containerName: 'Container',
  image: ecs.ContainerImage.fromDockerImageAsset(imageProcessorDockerImage),
  essential: true,
});

As soon as I added this to my Amplify Gen2 + CDK project, my build started to fail without any clear error message:

2025-04-27T08:46:25.054Z [INFO]: 8:46:25 AM Building and publishing assets...
2025-04-27T08:46:26.340Z [INFO]: 
2025-04-27T08:46:26.341Z [WARNING]: ampx pipeline-deploy
                                    Command to deploy backends in a custom CI/CD pipeline. This command is not inten
                                    ded to be used locally.
                                    Options:
                                    --debug            Print debug logs to the console
                                    [boolean] [default: false]
                                    --branch           Name of the git branch being deployed
                                    [string] [required]
                                    --app-id           The app id of the target Amplify app[string] [required]
                                    --outputs-out-dir  A path to directory where amplify_outputs is written. I
                                    f not provided defaults to current process working dire
                                    ctory.                                         [string]
                                    --outputs-version  Version of the configuration. Version 0 represents clas
                                    sic amplify-cli config file amplify-configuration and 1
                                    represents newer config file amplify_outputs
                                    [string] [choices: "0", "1", "1.1", "1.2", "1.3", "1.4"] [default: "1.4"]
                                    --outputs-format   amplify_outputs file format
                                    [string] [choices: "mjs", "json", "json-mobile", "ts", "dart"]
                                    -h, --help             Show help                                     [boolean]
2025-04-27T08:46:26.341Z [INFO]: 
2025-04-27T08:46:26.342Z [INFO]: [CDKAssetPublishError] CDK failed to publish assets
                                  ∟ Caused by: [_ToolkitError] Failed to publish asset data Nested Stack Template (current_account-current_region)
                                  Resolution: Check the error message for more details.
2025-04-27T08:46:26.342Z [INFO]: 
2025-04-27T08:46:26.343Z [INFO]: 
2025-04-27T08:46:26.345Z [INFO]:

Appending a debug flag to the amplify deploy command helped to identify Docker was not available in the path, although hidden through an INFO message somewhere in the middle of the logs.

version: 1
backend:
  phases:
    build:
      commands:
        - npm i
        - npx ampx pipeline-deploy --branch $AWS_BRANCH --app-id $AWS_APP_ID --debug
frontend:
  phases:
    build:
      commands:
        - npm run build
  artifacts:
    baseDirectory: dist
    files:
      - '**/*'
  cache:
    paths:
      - .npm/**/*
      - node_modules/**/*

2025-04-27T10:15:35.435Z [INFO]: 10:15:35 AM [deploy: CDK_TOOLKIT_E0000] 10:15:35 AM amplify-main-branch: fail: Unable to execute 'docker' in order to build a container asset. Please install 'docker' and try again.

In this case, we need to use a different image compatible with AWS CodeBuild. A list of official AWS CodeBuild curated Docker images can be found here.

By scrolling down in the Hosting/Build settings area, it is possible to set a custom image, such as public.ecr.aws/codebuild/amazonlinux-x86_64-standard:5.0, *which contains Docker.*

After switching to the image mentioned above, it is still necessary to start dockerd by running the /usr/local/bin/dockerd-entrypoint.sh script before building with Docker in your amplify.yml, like the example below:

version: 1
backend:
  phases:
    build:
      commands:
        - /usr/local/bin/dockerd-entrypoint.sh      
        - npm i
        - npx ampx pipeline-deploy --branch $AWS_BRANCH --app-id $AWS_APP_ID --debug
frontend:
  phases:
    build:
      commands:
        - npm run build
  artifacts:
    baseDirectory: dist
    files:
      - '**/*'
  cache:
    paths:
      - .npm/**/*
      - node_modules/**/*

And voilà! Your project should now be able to build the Docker image successfully.

Thanks for reading! Do you have any cool ideas or feedback you'd like to share? Please drop a comment, send me a message, or follow me, and let’s keep building!

Automate Email Processing using Event Driven Architecture and Generative AI

Felipe Malaquias — Wed, 05 Feb 2025 03:10:17 +0000

Recently, I came up with a use case for the application I am currently working on where I wanted to automate email processing in order to extract key information in a structured format I could then use later on in other internal processes. To achieve this with the least possible cost in AWS while maintaining scalability and efficiency, I implemented the following architecture:

I am using SES as an entry point for the emails. SES will then store the raw email in S3 bucket according to the rules defined in the CDK stack below:

const bucket = new Bucket(sesStack, 'SesBucket',
  {
    bucketName: `my-bucket-${process.env.AWS_BRANCH}`,
    publicReadAccess: false,
    removalPolicy: RemovalPolicy.DESTROY,
    intelligentTieringConfigurations: [
      {
        name: 'SES-Intelligent-Tiering',
        archiveAccessTierTime: Duration.days(90),
        deepArchiveAccessTierTime: Duration.days(365),
      },
    ],
    eventBridgeEnabled: true,
  }
);

new ReceiptRuleSet(sesStack, 'SesRuleSet', {
  rules: [
    {
       recipients: ['myemail@myaddress.com'],
       actions: [
         new S3({
           bucket,
           objectKeyPrefix: 'emails/raw',
         }),
       ],
       scanEnabled: true,
    },
 ],
});

Because the code above enables EventBridge events on the bucket, we can then create a new EventBridge rule to trigger a StepFunction that will then process the emails as follows:

const cleanseEmailFunction = new NodejsFunction(stack, 'CleanseEmailFunction', {
  ...getCommonLambdaProps(),
  entry: './amplify/functions/cleanseEmail.ts'
})
bucket.grantReadWrite(cleanseEmailFunction)

const extractDataFunction = new NodejsFunction(stack, 'ExtractDataFunction', {
  ...getCommonLambdaProps(),
  entry: './amplify/functions/extractDataFunction.ts',
})
table.grantReadWriteData(extractDataFunction)
bucket.grantRead(extractDataFunction)

const stateMachine = new StateMachine(stack, 'EmailProcessingStateMachine', {
  definitionBody: DefinitionBody.fromFile('./amplify/step-functions/processEmails.asl.json'),
  timeout: Duration.minutes(29),
  tracingEnabled: true,
  stateMachineType: StateMachineType.STANDARD,
  logs: {
    level: LogLevel.ALL,
    destination: new LogGroup(stack, 'ProcessEmailsStateMachineLogs', {
    logGroupName: '/aws/vendedlogs/states/ProcessEmailsStateMachine',
    retention: RetentionDays.ONE_WEEK,
  }),
  includeExecutionData: true,
},
definitionSubstitutions: {
  TableName: table.tableName,
  CleanseEmailFunction: cleanseEmailFunction.functionName,
  ExtractDataFunction: extractDataFunction.functionName,
},
comment: 'State machine to process email'
});

extractDataFunction.grantInvoke(stateMachine);
cleanseEmailFunction.grantInvoke(stateMachine);

const rule = new Rule(stack, 'EmailS3ObjectCreatedRule', {
  eventPattern: {
    source: ['aws.s3'],
    detailType: ['Object Created'],
    detail: {
      bucket: {
        name: [bucket.bucketName]
      },
      object: {
        key: [
          {
            prefix: "emails/raw/"
          },
        ]
      }
   }
}
});

rule.addTarget(new SfnStateMachine(stateMachine));

This state machine is composed of basically two functions:

CleanseEmailFunction: removes all sensitive data
ExtractDataFunction: uses Langchain and LangSmith to validate and extract structured JSON info through Bedrock and Sonnet 3.5 v2 and then store it in DynamoDB for later use

Later, I could use this data to perform summarization and analytics, send tailored push notifications to users, and so on.

This straightforward architecture to automate email processing can be extended as needed to perform additional tasks, like evaluating the model response and performing further transformations if needed. The best of all is that I pay only for what I use.

If you want to scale it further and optimize costs, you may add a queue with SQS in between and use batch inference to process emails at once.

Please notice that batch inference may not be enabled in your account, and you may need to request it through support, though (I’ve been fighting with support to get access for more than a month now — with a business support plan).

Thanks for reading! Got any cool ideas or feedback you want to share? Drop a comment, send me a message, or follow me, and let’s keep building!

Migrating From Redis to Valkey Serverless

Felipe Malaquias — Mon, 27 Jan 2025 01:58:04 +0000

Why?

Summer last year we started refactoring one of our services in order to get rid of I/O blocking operations. It was almost a complete rewrite of the service in about 2 months.
After running the updated service for some time, we found out the hard way Elasticache Redis’ maximum allowance bandwidth constraints during weeks of intense traffic.

The traffic doubled in a very short period of time and exceeded the maximum bandwidth allowance for one of our Redis nodes for a sustained period of time. When this happens, queue increases and AWS starts to drop packets. There were a couple of issues that led this to cause a cascade effect, practically putting our service down for almost one hour while we identified the problem, scaled the nodes accordingly and waited for the cluster to balance.

Notice the “sustained” wording in the paragraph above.

Despite the fact we ran load tests in this service, we were not able to see this issue before because the test only ran from 10 to 30min maximum, and Elasticache allows you to exceed the network baseline for some undetermined period of time (up to an hour if I am not mistaken) until it starts to drop packets.

There we had a couple of problems leading to the downtime:

Problem #1: during the refactoring, we missed to set the default timeout for the Redis client, leading it to wait for 60 seconds (default) before timing out, causing our clients to timeout first.

Problem #2: during the refactoring, we also missed to migrate our circuit breaker implementation to our custom cache handling, and therefore, never skipping the cluster and going directly to the DB and performing requests normally as it should (notice here the cache in this case is used for answering fast to clients, and the DB should always be sized accordingly in order to handle normal load if caching is not present for any reason).

Problem #3: during the rewriting, we optimised our startup time to a few seconds by removing a local cache in memory for one of our data structures. The issue we didn’t realise is that by doing so, we created a hot shard in our cluster, because the key for such lookups were not hashed and the values were basically always the same for all requests, leading to the load to not be distributed over our shards and therefore consuming more bandwidth from one of its nodes and making it not horizontally scalable.

Problem #4: because we could not predict this sudden load accordingly, our Elasticache cluster was not sized accordingly (e.g: by increasing number of shards) and therefore hit the network limitation even though the cluster seemed healthy at a first glance (CPU and MEM).

The mitigation in this case was to simply scale the cluster vertically and increase the number of shards, and the final solution was to identify and fix each one of the points mentioned above in the next days.

After that, we knew AWS offered a Serverless version of Elasticache which we planned to have a look at for some time already, and we’ve just had heard about Valkey, announced about 1 month prior to this incident.

Elasticache Valkey

Valkey is an opensource project forked from the open source Redis project right before the transition to their new source available licenses. Because of the transition, AWS and other tech giants started to contribute for the project, looking to keep Redis compatibility while enhance overall functionality and performance.

The highlights from my point of view and experience so far are:

lower price than other engines (up to 33% lower)
it provides microseconds read and write latency and can scale to 500 million requests per second (RPS) on single self-designed (node-based) cluster
it is compatible with Redis OSS APIs and data formats
zero downtime migration
continuous updates (in exchange with some of the people involved in the project, they opened up ideas and plans to improve the service further, which will make it even more attractive in the future)

Valkey is offered in both cluster and serverless variants.

You can read about all its bells and whistles at official AWS documentation and also see how it works in this video presented by one of the maintainers last summer:

Serverless Elasticache Valkey

In addition to the highlights mentioned above, the serverless variant abstracts the cluster managing (minor updates) and sizing, which is very interesting in special if your traffic can suddenly change as pictured in the beginning of this article.

Of course, not everything is flowers. The serverless variant may become very expensive depending on your workload and if you have a predictable and sustained load, you’d be probably paying much more for the serverless variant than the self-designed cluster one.

In the serverless variant, you are billed by Storage and ECPU (ElastiCache Processing Units). Storage is straight forward and you can already estimate it based on your current values. ECPU is however a bit more tricky as it is basicallly the processing time, which is affected by the payload size and type of commands you execute. In general, 1 ECPU relates to approx. 1kb of payload data (read this for more information).

However, the very cool things about the serverless variant are:

If your workload is periodical or irregular, you might save on costs during low usage periods
ElastiCache Serverless for Valkey can double the supported requests per second (RPS) every 2–3 minutes, reaching 5M RPS per cache from zero in under 13 minutes, with consistent sub-millisecond p50 read latency

Take the following metrics as an example:

The red and yellow areas in some of those panels are set only for cost control purposes, but the cluster is able to scale much more than that. There we can see traffic doubling in a very short amount of time and no throttles observed. Elasticache Valkey Serverless scales seamslessly in order to support such traffic increase, and downscales to the minimum configured ECPU value when traffic decreases.

Because our service traffic has such sinusoidal look and the amount of data we transfer per second under normal load is not very high, we end up saving on costs with the benefit of it autoscaling as it needs.

Careful considerations

Before deciding to switch to Elasticache Serverless, analyse careful your workload and the AWS documentation in order to identify for example:

is your cache traffic predictable and relatively constant or it is periodical or unpredictable?
how much ECPU your application would consume under normal load?
how much storage your application requires?
do you need/want to set a minimum and/or maximum ECPU/s?
do you need/want to set a minimum and/or maximum storage?

Be aware, for example, that if you set maximum constraints, your application might receive errors from Elasticache when it surpasses such values, instead of scaling. However, setting limits might be a good idea if your application can tolerate errors and want to avoid excessive costs (e.g.: bypassing cache with circuit breakers and low client timeouts).

Setting minimum values, on the other hand, could be a good idea to guarantee that your Elasticache will serve at least that amount of data at any given time.

Use AWS’ pricing calculator to estimate how much it would cost you and make the better decision for your own use case.

Make sure to also double check your security group rules, as the serverless variant requires the 6380 port for the reader nodes in addition to the standard 6379, otherwise, your application might start but you may experience latency. Read more here.

Good luck!

Thanks for reading! Got any cool ideas or feedback you want to share? Drop a comment, send me a message or follow me and let’s keep moving things forward!

How did I contribute for OpenAI’s Xmas Bonus before cutting 50% costs while scaling 10x with GenAI processing

Felipe Malaquias — Tue, 24 Dec 2024 13:40:04 +0000

Yet another screw-up.

TLDR: use OpenAI’s Files and Batch APIs for async non-time-sensitive processing. Code below.

I screw up so often that I actually learned to love the process of screwing up!

Please don’t confuse it with being reckless, though. Think of it as a fast incremental learning process, just like fine-tuning a model.

The Most Recent Screw-Up

At the beginning of this year (2024), I created my first automation using GenAI for prototyping a travel app I used for giving a talk at the AWS User Group Berlin meetup showing case the new Amplify Gen 2 services (link here). At that time, unfortunately, there was only one way to generate chat completions using OpenAI’s */v1/chat/completions* API.

Recently, I needed to automate another process for which GenAI was a good fit. Therefore, I reused the same approach I had before and created the following state machine using AWS Step Functions:

I somehow have a fetish for diagrams, so I found myself proud and smart after I finished it - a feeling that didn’t last that long until I realized I had neither the money for my brother’s Xmas gift nor was I smart.

Without going into details on the workflow itself, there are mainly two issues with this approach:

Although step functions are great for automation, they have their limitations (e.g.: maximum number of history events — steps— of 25000)
AWS offers 4000 free state transitions as part of their free tear. Anything above that, you pay.
APIs are rate-limited in order to avoid abuse. OpenAI is, of course, no different (especially as it is being explored by so many people around the world right now). Therefore, it greatly restricts your ability to parallelize.
GenAI is generally still very slow considering low latency APIs all around us nowadays, especially if you require more complex models like o1. Therefore, if you need to iterate over 25000 prompts without fine-tuning your model, you will see yourself waiting more than 6 hours for it to complete and eventually failing.
Models like o1 are still relatively expensive when a large number of tokens are requested.

So, after my first frustrated run for my new use case, I had:

Waited for 05:45:36.299 hours until the step function failed with a runtime error: The execution reached the maximum number of history events (25000).
Spent $90 on tokens
Exceeded my free 4000 state transitions limit in my AWS account

The Fix

As I was, of course, in disbelief there wouldn’t be a better way to achieve what I wanted, I started re-reading the documentation.… to my happy surprise, I see a shiny new batch API I had overlooked, launched in April this year (2024).

So I wrote the following lambda (in typescript) instead:

import OpenAI, { toFile } from 'openai';
import { BatchWriteCommand, BatchWriteCommandInput, DynamoDBDocumentClient } from '@aws-sdk/lib-dynamodb';
import { DynamoDBClient, GetItemCommand, QueryCommand } from '@aws-sdk/client-dynamodb';
import { v4 as uuidv4 } from 'uuid';
import { GenerateTipsEvent } from '../shared/types/tips';
import { FileLike } from 'openai/uploads.mjs';
import { ChatCompletionCreateParamsNonStreaming } from 'openai/resources/chat/completions.mjs';

// ParamsAndSecretsLayerVersion used for secrets retrieval/caching
const AWS_SECRETS_EXTENTION_SERVER_ENDPOINT = "http://localhost:2773/secretsmanager/get?secretId="

let openai: OpenAI | undefined;

const BATCH_REQUESTS_TABLE = process.env.BATCH_REQUESTS_TABLE || 'BatchRequests';

const ddbClient = new DynamoDBClient({});
const ddb = DynamoDBDocumentClient.from(ddbClient);

// one single request within the batch
interface BatchRequestLineItem {
  custom_id: string;
  method: string;
  url: string;
  body: ChatCompletionCreateParamsNonStreaming;
}

interface BatchRequestInput {
  lineItem: BatchRequestLineItem;
  custom_id: string;
  // any other field you may want use later in post processing
}

// split db command chunks to avoid exceeding max limits
const chunk = <T>(arr: T[], size: number): T[][] => {
  return Array.from({ length: Math.ceil(arr.length / size) }, (_, i) =>
    arr.slice(i * size, i * size + size)
  );
};

async function initOpenAi() {
  if (!openai) {
    const openAiSecret = JSON.parse(await getSecretValue(process.env.OPEN_AI_SECRET_NAME!))

    openai = new OpenAI({
      apiKey: openAiSecret.apiKey,
      organization: openAiSecret.orgId,
    });
  }
}

const getSecretValue = async (secretName: string) => {
  const url = `${AWS_SECRETS_EXTENTION_SERVER_ENDPOINT}${secretName}`;
  const response = await fetch(url, {
    method: "GET",
    headers: {
      "X-Aws-Parameters-Secrets-Token": process.env.AWS_SESSION_TOKEN!,
    },
  });

  if (!response.ok) {
    throw new Error(
      `Error occured while requesting secret ${secretName}. Responses status was ${response.status}`
    );
  }

  const secretContent = (await response.json()) as { SecretString: string };
  return secretContent.SecretString;
};

const getTopicsForCategory = async (categoryId: string) => {
  const result = await ddb.send(new QueryCommand({
    // ... boring stuff
  }));
  return result.Items || [];
};

const getSectionsForTopic = async (topicId: string) => {
  const result = await ddb.send(new QueryCommand({
    // ... boring stuff
  }));
  return result.Items || [];
};

const getUnitsForSection = async (sectionId: string) => {
  const result = await ddb.send(new QueryCommand({
    // ... boring stuff
  }));
  return result.Items || [];
};

const getCategory = async (categoryId: string) => {
  const result = await ddb.send(new GetItemCommand({
    // ... boring stuff
  }));
  return result.Item;
};

async function generateBatchRequests(
  categoryId: string,
  model: string,
  numberOfTips: number
): Promise<BatchRequestInput[]> {
  try {
    // Fetch initial data in parallel
    const [category, topics] = await Promise.all([
      getCategory(categoryId),
      getTopicsForCategory(categoryId),
    ]);

    if (!category?.id.S) {
      throw new Error(`Category not found: ${categoryId}`);
    }

    console.log('Generating batch requests for category', category.id.S);
    console.log('Found topics:', topics.length);

    const batchRequests: BatchRequestInput[] = [];

    // Process topics
    for (const topic of topics) {
      if (!topic.id.S) continue;

      const sections = await getSectionsForTopic(topic.id.S);
      console.log(`Found ${sections.length} sections for topic ${topic.id.S}`);

      // Process sections
      for (const section of sections) {
        if (!section.id.S) continue;

        const units = await getUnitsForSection(section.id.S);
        console.log(`Found ${units.length} units for section ${section.id.S}`);

        // Process units
        for (const unit of units) {
          if (!unit.id.S) continue;

          const event = {
            // ... any custom data used in your prompts
            model: model,
          };

          const customId = uuidv4();
          batchRequests.push({
            custom_id: customId,
            lineItem: {
              custom_id: customId,
              method: 'POST',
              url: '/v1/chat/completions',
              body: {
                model: model,
                messages: [
                  { role: 'system', content: getSystemPrompt(event) },
                  { role: 'user', content: getUserPrompt(event) }
                ],
                response_format: { type: "json_object" },
                temperature: 0.3,
              },
            },
            // ... any other field you may want to correlate later on post processing
          });

          console.log(`Created batch request for unit ${unit.id.S}`);
        }
      }
    }

    console.log(`Total batch requests generated: ${batchRequests.length}`);
    return batchRequests;

  } catch (error) {
    console.error('Error generating batch requests:', error);
    throw error;
  }
}

interface GenerateTipsBatchRequestEvent {
  categoryId: string;
  model: string;
  numberOfTips: number;
}

export const handler = async (event: GenerateTipsBatchRequestEvent) => {
  try {
    const { categoryId, model, numberOfTips } = event;
    if (!model || !categoryId || !numberOfTips) {
      return {
        statusCode: 400,
        body: JSON.stringify({ error: 'Missing required parameters' }),
      };
    }

    await initOpenAi();

    console.log('Generating batch requests');
    const batchRequestInputs = await generateBatchRequests(categoryId, model, numberOfTips);

    if (batchRequestInputs.length === 0) {
      return {
        statusCode: 404,
        body: JSON.stringify({ error: 'No units found' }),
      };
    }

    const files = await createBatchFile(batchRequestInputs.map(batchRequestInput => batchRequestInput.lineItem));

    for (const file of files) {
      console.log('Uploading file ', file.name);
      const upload = await openai?.files.create({
        purpose: 'batch',
        file: file
      });
      if (!upload) continue;

      console.log('File uploaded', JSON.stringify(upload, null, 2));

      console.log('Creating batch');
      const requestedBatch = await openai?.batches.create({
        completion_window: '24h',
        endpoint: '/v1/chat/completions',
        input_file_id: upload.id,
        metadata: {
          // ... any metadata you may want to add
        }
      });
      console.log('Batch created', JSON.stringify(requestedBatch, null, 2));

      if (!requestedBatch) continue;

      console.log('Storing batch request', batchRequestInputs.length);
      await storeBatchRequest(batchRequestInputs.map(batchRequestInput => ({
        customId: batchRequestInput.custom_id,
        body: JSON.stringify(batchRequestInput.lineItem),
        batchId: requestedBatch.id,
        filename: upload?.filename,
        bytes: upload?.bytes,
        status: requestedBatch.status,
        // ... and more fields you may want to correlate later on post processing
      })));

      console.log('Batch request stored');
    }

    return {
      statusCode: 200,
      body: JSON.stringify({
        message: 'Batch request stored',
        batchRequestInputsLength: batchRequestInputs.length,
      }),
    };
  } catch (error) {
    console.error('Error:', error);
    throw error;
  }
}

async function storeBatchRequest(requests: {
  status: string;
  type: string;
  body: string;
  batchId: string | undefined;
  filename: string | undefined;
  bytes: number | undefined;
  customId: string;
  // ... and more properties you want to correlate later on post processing
}[]) {
  try {
    console.log('Starting to store batch requests:', requests.length);

    // Split requests into chunks of 25 (DynamoDB batch write limit)
    const batches = chunk(requests, 25);
    console.log('Split into', batches.length, 'batches');

    for (let i = 0; i < batches.length; i++) {
      const batch = batches[i];
      console.log(`Processing batch ${i + 1} of ${batches.length}`);

      const batchWriteParams: BatchWriteCommandInput = {
        RequestItems: {
          [BATCH_REQUESTS_TABLE]: batch.map(request => ({
            PutRequest: {
              Item: {
                id: request.customId,
                ...request,
                createdAt: new Date().toISOString(),
                updatedAt: new Date().toISOString()
              }
            }
          }))
        }
      };

      try {
        console.log(`Sending batch write command for batch ${i + 1}`);
        const result = await ddb.send(new BatchWriteCommand(batchWriteParams));

        // Check for unprocessed items
        if (result.UnprocessedItems && Object.keys(result.UnprocessedItems).length > 0) {
          console.warn('Unprocessed items:', result.UnprocessedItems);
          // Optionally retry unprocessed items
          await retryUnprocessedItems(result.UnprocessedItems);
        }

        console.log(`Successfully processed batch ${i + 1}`);
      } catch (error) {
        console.error(`Error writing batch ${i + 1}:`, error);
        throw error;
      }
    }

    console.log('Successfully stored all batch requests');
  } catch (error) {
    console.error('Error in storeBatchRequest:', error);
    throw error;
  }
}

// Helper function to retry unprocessed items
async function retryUnprocessedItems(unprocessedItems: Record<string, any>) {
  try {
    const retryParams: BatchWriteCommandInput = {
      RequestItems: unprocessedItems
    };
    await ddb.send(new BatchWriteCommand(retryParams));
  } catch (error) {
    console.error('Error retrying unprocessed items:', error);
    throw error;
  }
}

async function createBatchFile(items: BatchRequestLineItem[], maxSizeMB: number = 200): Promise<FileLike[]> {
  const MAX_FILE_SIZE = maxSizeMB * 1024 * 1024; // Convert MB to bytes
  const files: FileLike[] = [];
  let currentItems: BatchRequestLineItem[] = [];
  let currentSize = 0;
  let fileIndex = 0;

  const createBatchFile = async (items: BatchRequestLineItem[], index: number): Promise<FileLike> => {
    // Convert items to JSONL string
    const jsonlContent = items
      .map(item => JSON.stringify(item))
      .join('\n');

    // Create file using OpenAI's toFile utility
    return await toFile(
      new Blob([jsonlContent], { type: 'application/jsonl' }),
      `batch_${Date.now()}_${index}.jsonl`
    );
  };

  // Process items and create batches
  for (const item of items) {
    const line = JSON.stringify(item) + '\n';
    const itemSize = Buffer.byteLength(line, 'utf-8');

    // Check if adding this item would exceed the size limit
    if (currentSize + itemSize > MAX_FILE_SIZE) {
      // Create file from current batch
      const file = await createBatchFile(currentItems, fileIndex);
      files.push(file);

      // Reset for next batch
      currentItems = [];
      currentSize = 0;
      fileIndex++;
    }

    // Add item to current batch
    currentItems.push(item);
    currentSize += itemSize;
  }

  // Process remaining items
  if (currentItems.length > 0) {
    const file = await createBatchFile(currentItems, fileIndex);
    files.push(file);
  }

  return files;
}

const getSystemPrompt = (event: GenerateTipsEvent) => `You are... rest of prompt`

const getUserPrompt = (event: GenerateTipsEvent) => `Generate ... rest of prompt`

This function ran in 4518.0ms to upload one file of 4Mb with 391 prompts of a total of 1,301,368 tokens (in and out).
The processing of those prompts cost about $14, and it took 18 minutes for the batch to complete.

Yes, the numbers in the title don’t match, but the numbers above are only partial, which matches the generation of tips in the workflow screenshot above. For the sake of simplicity and deduplication, I omitted the processing of quizzes, which follows the same approach.

Please bear in mind the batch results may be available up to 24h. So it’s something you should consider only if you don’t need an answer right away.

Learning: Why do now what we can put off until later?! :)

Thanks for reading! Got any cool ideas or feedback you want to share? Drop a comment, send me a message or follow me and let’s keep moving things forward!

So long DocumentDB, hello MongoDB Atlas

Felipe Malaquias — Sun, 27 Oct 2024 12:11:39 +0000

Why did we replace DocumentDB with MongoDB Atlas?

Update (August 01, 2025): DocumentDB now supports serverless configuration, which is ideal for spikey workloads. See https://aws.amazon.com/blogs/aws/amazon-documentdb-serverless-is-now-available/ for more information.

Update (March 16, 2025): Since this article was originally published, the AWS DocumentDB team has implemented several significant improvements that address some points raised here, including NVMe support, enhanced memory usage for network I/O, additional compression options, and configurable shard instances. While I believe this article still provides valuable insights, I strongly recommend reviewing the latest AWS DocumentDB and MongoDB Atlas documentation for the most up-to-date information.

TLDR: Writes IOPS scaling capabilities (sharding) and minor perks.

Disclaimer: This article does not cover application optimization but rather focuses on comparing these two database services.

Before we start, why did we even use DocumentDB?

While moving our on-premise workload to AWS, DocumentDB seemed like a natural choice to replace our in-house-maintained MongoDB cluster.

DocumentDB is a robust and reliable database service for MongoDB-based applications and can make your lift and shift process easier, but there are probably better fits for your needs.

Why? The reasons vary according to your needs, but they will probably boil down to one or more of the ones described below.

Did you write your code using MongoDB drivers and expect it to behave like MongoDB? It won’t.

Amazon DocumentDB is built on top of AWS’s custom Aurora platform, which has historically been used to host relational databases.

This is the first statement on the architectural comparison at the MongoDB website. Amazon DocumentDB supports MongoDB v4.0 and v5.0, but it does not support all features from those versions or from newer versions (e.g., the latest MongoDB v8.0). DocumentDB is currently only about 34% compatible with MongoDB.

Main differences that were relevant for us:

no support for index prefix compression
no support for network compression
poor support of data compression (requires manual setup per collection and setup of threshold per document)
further index type limitations in elastic setup
built on top of Aurora, hence difficult to keep up with updates and to be able to support the same capabilities as MongoDB

Low capabilities for scaling write operations

While Aurora’s storage layer is distributed, its compute layer is not, limiting scaling options.

This is the second statement on the architectural comparison page, and it is true.

While DocumentDB is great for scaling read-heavy applications (currently, up to 15 instances—see the limits described here), it provides only a few options for scaling writes, and even so, those features were mainly introduced in mid-2024.

That means if your cluster is spending too much time waiting for IO (check your cluster performance insights and disk queue depth) and you need to scale write IOPS/throughput, for example, you can enable the I/O optimized flag, which will reduce costs and use SSD to increase performance. However, I found no clear documentation about how many IOPS it translates to, but from tests, it seems like it is using EBS gp3 volumes, which translates to 3000 IOPS baseline.

In addition to that, the only remaining options are:

Scaling vertically: very expensive and no clear documentation on how it translates to improved IOPS, as it will result in CPU increase, allowing incoming requests to be processed, but storage will still be a bottleneck for writes)
Sharding: Sharding in DocumentDB is a brand new feature released in January 2024, and it is still not supported in all regions. In addition, it currently has some serious limitations. For example, last I checked, one could only set 2 nodes per shard, which means very low resilience.

And that’s it. Those are the only options for scaling write operations in DocumentDB at the moment. If financial cost, resilience, and scalability are considered, they will likely not be a fit for a scenario of write-heavy applications that require availability close to 100% and high write IOPS.

Better cloud-native technologies

The last reason could be that you are building a new application and have the flexibility of choosing some other cloud-native technology that better fits your use case without the concern of having to refactor your application’s data layer and introduce breaking changes.

Document-based DBs offer great flexibility for storing data, but nowadays, several technologies, like DynamoDB, can provide better scalability and throughput with serverless offers. However, that requires a different approach for your application, each with its pitfalls.

If you are writing a brand new application, think about how your data changes over time, how it is structured, how often it’s going to be inserted, updated, deleted, queried, think about the required availability for your DB cluster, its resilience, costs vs. risk of financial loss with downtimes, etc., and try to choose a technology and sizing that better fits your specific use case. You’ll highly likely end up with something else other than DocumentDB.

What are the benefits of migrating from DocumentDB to MongoDB Atlas?

In short, besides vertical scaling (usually a much more expensive solution), MongoDB has better support for sharding, different instance types offering (low-CPU, general, and NVMe), and provisioned IOPS, which offers more fine-grained control over your cluster capabilities and costs.

The table below shows a brief comparison between both solutions and their current functionality as of the writing of this article.

One can also set up an Atlas cluster using private links, which might improve latency (especially for queries that require getMore), but I haven’t tested this yet.

In addition, one can write code to have some compute autoscaling in DocumentDB, but it’s not supported out of the box. For more info, see recommendations for DocumentDB scaling here.

NVMe instances

At first, I thought NVMe instances would solve all our problems. Don’t let yourself get fooled by it!

Yes, they do provide crazy amounts of IOPS, which may result in excellent throughput, and they cost relatively low in comparison to obtaining way poorer results with something like provisioned IOPS (capped at max. 6000 due to storage block restrictions from AWS). Still, it also comes at a more significant cost: a significantly long time to recover.

In general, NVMe clusters are very robust and performant, but to achieve such high storage throughput, they work with locally attached ephemeral NVMe (non-volatile memory express) SSDs. As a consequence, a file copy based initial sync will always be used to sync all of the nodes of an NVMe cluster whenever an initial sync is required, and because of that, if you need to scale your cluster up/down, or recover backups, you will experience a very long time to perform these operations, and if you need to scale fast, you will find yourself in a terrible situation.

My advice? Avoid those as much as you can by optimizing your application and scaling the DB cluster horizontally by adding shards if you need write IOPS, simply scaling the number of instances if you need read IOPS only, or both in the worst case. You might even end up with a cheaper price for similar or even better performance.

Low-CPU instances

Those instances are great if your application doesn’t require much processing on the DB side, which is a good practice.

CPU and memory are much more valuable resources on the DB side than in your application container. Scaling application containers and distributing parallel processing is a much cheaper and less time-sensitive operation than scaling DB clusters, especially with all the easy-to-use and great capabilities of Kubernetes, Karpenter in combination with spot instances, and so on.

By using low CPU instances, you may benefit from better pricing when choosing instances with higher memory available. This is very important for caching and consequently speeds up your queries by reducing the need to load data from disk, which is slow and can quickly degrade the performance of your cluster.

If you have questions about sizing, I recommend reading the official MongoDB documentation.

General instances

Those instances have double the CPU than the low-CPU tier, but they also come at about 20% price increase. So, if you require processing peaks and can afford the price, go for it.

Atlas Console

Atlas console offers great features for executing queries and aggregation pipelines in your databases, as well as intelligent detection of inefficient indexes and much more.

Because of the features offered within the console and the ease of connecting to the cluster through Mongo Shell, we no longer need third-party tool licensing, such as Studio3T, for example.

In addition, it offers much more in-depth metrics than DocumentDB for analyzing your cluster, like how much data is being compressed on disk.

You might want to ship these metrics to another place like Grafana, though, because if you want to analyze peaks in the past, MongoDB Atlas metrics will be calculated as the average of 1h to save some processing, and therefore, they will not be very useful in that regard.

Query Insights

The main reason we opted to migrate from DocumentDB to MongoDB Atlas was the capability to scale write throughput. Still, I have to confess that the metrics and tools offered by Atlas make developers' lives much easier by providing an excellent overview of overall DB performance, pointing out slow queries that may highly likely be optimized on the application side, consequently making applications faster and more reliable to the final users, and providing opportunities to reduce costs by fine-tuning the DB cluster according to your needs.

Query Profiler

The query profiler clusters queries so that you can analyze how the engine processed them in great detail. When you click in one of those clusters below, you will find information about how many keys were examined, how many documents were read, how long the query planner took to process the query, how long it took to read the documents from disk, and much more.

The coloring also makes it very easy to identify the slowest collections in your DB, which may help to identify strange access patterns and inefficient data structure and/or indexing, among other possible problems.

Support

I believe nobody can provide better support for some service, tooling, or framework than the source itself. So, if you have a MongoDB-based application, I think MongoDB experts may be able to help you :)

We had a good experience with tailored support for evaluating and identifying bottlenecks, exchanging solutions, and sizing our cluster. Although AWS also offers good support, from my personal experience, DocumentDB experts will only analyze the health of your cluster itself, but will not dive deep into your needs and make recommendations based on your application implementation.

As we have an enterprise contract with MongoDB Atlas (no, they don’t sponsor this article by any means, and all the content here expresses my own opinion and experience), we could benefit from an in-depth analysis of our needs before we migrate the data until after go live.

Drawbacks

If you migrate things as they are without identifying issues in your application and solving them beforehand, you might see yourself paying a lot for overprovisioning your cluster, as they are not the cheapest thing around.

In addition, it might add complexity to your setup and require developers to obtain more in-depth knowledge of DB setup, sharding, and query optimization. Still, I do see this as a benefit. Knowing things work without knowing why is dangerous.

On the other hand, more complexity and fine-tuning opportunities also pose more risks of messing things up, so you will need to pay more attention to details while setting the DB up.

Compute Auto-Scaling

As odd as it sounds, I considered adding auto-scaling under the drawbacks session. The reason is that as good as auto-scaling sounds, and as good as it is portrayed in MongoDB Atlas’ documentation, it may cause more harm than good in your applications.

The reason is that the autoscaling happens on a rolling basis, which is ok. However, it will take nodes down one by one before updating them, which will cause the performance of your cluster to degrade even further because the load is shifted to the other remaining nodes and may lead to a downtime for a longer time than it would in case the autoscaling would be disabled and your application could have stabilized due to caching and other mechanisms. Therefore, if your application needs to handle such peaks without any unavailability, you might need to disable autoscaling and overprovision beforehand, knowing when your application expects peaks, for example.

If this scenario is not your concern, auto-scaling might be a handy tool for optimizing costs while dealing with extra load when necessary.

Hatchet

Well, that has nothing to do with MongoDB Atlas itself, but I learned this tool from a MongoDB consultant during one of our sessions and thought it would be helpful to share it here.

Hatchet is a MongoDB JSON log analyzer and viewer implemented by someone from MongoDB that provides great support for query analysis. It also has a text search that makes it quicker to find issues. You just need to export the logs directly from the MongoDB console and import them into Hatchet, which will provide you with a summary of the insights in addition to some details about them.

Check it out if you ever need to go through MongoDB logs.

Performance and Costs

Finally, let’s talk about what really matters.

Before discussing performance and cost comparisons, let’s discuss our use case so we can better understand the problem.

This specific database serves two backend services. One backend service (let’s call it the Writer application) listens to several Kafka topics, aggregates the data in an optimal way for reads by the other service, and writes it to the DB. It connects to primaries only (primary read preference) and is write-heavy with few parallel connections to the DB.

In this Writer application, we want to keep consumers lag always close to 0 in order to provide real-time, up-to-date data to the other application (let’s call it the Reader application). If we have lags in this application, it translates to outdated data in the Reader application, which should not happen (or at least it should be as close to real-time as possible).

The Reader application will connect to secondaries preferably (secondaryPreferred read preference) and is a read-only application that will perform thousands of queries per second and provide some output to other applications. The Reader application is read-heavy, and latency is also very critical, in addition to high availability. This application must run 24 hours a day, 365 days a year, with an overall average latency per processed request under 100ms, which translates ideally to something less than 10ms per DB query on average.

Scaling read operations in DocumentDB is not a problem and is not very expensive. One scales the number of replicas, distributes the load among them, and is done.

Scaling write operations in DocumentDB is, however, the challenge.

As you see in the example above, there were times when peaks of updates in some of those Kafka topics took a long time to process the data and store it in the DB. This was mainly because of CPU waits in the single primary node in our DocumentDB cluster, which already had CPU overprovisioned as a failed attempt to scale IOPS (CPU won’t scale IOPS further the storage capacity).

That is how sharding solves the problem. By distributing the data across multiple primary nodes, each in its shard, we can scale write operations horizontally, similarly to how we scale read operations by increasing the number of nodes. So, let’s say one primary can handle 3000 IOPS. By distributing the data over three shards, we increase the capacity three times to 9000 IOPS if your data is distributed evenly.

Unfortunately, DocumentDB has very low support for sharding (elastic cluster), offering only two nodes per shard, which means low resilience for critical workloads.

Be careful when dealing with shards, though. Sharding collections may make them way less efficient. So you’ll need to dig into optimizations, access patterns, index efficiency, and many more aspects before deciding to shard your collections. Also, be extra careful when choosing your shard keys to avoid hot shards and query inefficiency.

So, what does it mean in terms of performance and costs?

In DocumentDB, we operated a IO optimized db.r6g.8xlarge cluster, which costed us about EUR11k/month. In MongoDB Atlas, we used a M40 cluster with 4 shards and 3 nodes each for initial tests and comparison, which would cost us about half of the price — EUR4.7k/month.

The best thing is that in Atlas, you are not restricted to only two nodes per shard, which significantly helps resilience and read load distribution.

In our tests, we used one specific collection that has a high frequency of updates and is very large in size, which was a very good candidate for sharding. We basically reset the offset of the consumer writing to this collection and waited for it to process all messages in both MongoDB Atlas and DocumentDB clusters and obtained the following results:

If you convert DocumentDB metrics to updated documents per second, the throughput in MongoDB Atlas sharded cluster is about 5 times higher than in DocumentDB with no shards. Not to mention, the CPU was blocked most of the time waiting for IO in DocumentDB, which would make it very slow for processing other data, and as a consequence, leading to multiple outdated collections and slowness in processing all writes in the single primary node.

The difference can also be seen at client side, in the Writer application, by looking at its consumer Kafka lag as follows:

While DocumentDB processed all messages in about two hours, the sharded cluster in MongoDB Atlas took about 20 minutes.

Note that the tests were performed at different dates and timestamps, so the lag won’t match precisely the 5 to 7 times higher throughput, as the test with MongoDB Atlas was performed earlier at a point in time when the Kafka topic had fewer messages than it had when tested against DocumentDB. Therefore, in this case, the primary metric for comparison is the updated documents per second, but you can still grasp what it means in terms of impact to keep data up to date by looking into the Kafka lag metric.

In summary, we were able to achieve about five times better throughput by spending half the money. In reality, our setup is slightly different at the moment, and we end up paying about the same price we used to for DocumentDB, but that has to do with current autoscaling capabilities and shifting load during the scaling process.

Provisioned IOPS

Sharding is great for scaling write operations. However, it is not something you can use to quickly scale your IOPS when your system is already under heavy load, as it requires balancing documents between the shards. This process takes both time and resources, and it is usually scheduled to run during known periods of low traffic in your DB so as not to affect the performance of your application.

MongoDB Atlas offers the possibility of provisioning IOPS on demand as a tool for scaling IOPS from 3000 to 6000 per shard. This allows doubling the IOPS capacity of the complete cluster in a matter of minutes to enable more read/write capacity without the need to create new shards and wait for the cluster to be balanced, for example.

One could use provisioned IOPS as a temporary solution for a short period, postponing the creation of new shards, as provisioned IOPS tends to be relatively expensive.

Conclusion

It is not the technology that is good or bad; perhaps it’s being misused, or it’s not the best fit for your needs. No size fits all.

Don't change anything if you have no problems (cost, maintenance, performance, availability, etc.). Spend your time somewhere else.

If you do have some real problem to solve, get to know your data and your write and read patterns. Dig into query and index optimizations before you even think about any migration. Invest time in understanding what is really happening in your application that is causing latencies. Do not simply throw more money at cloud providers to scale things up indefinitely, postponing the unavoidable review of your own code and choices.

If you are writing a new application, look for database solutions that fit your needs and see if there is a better fit. It could be an SQL database, a serverless database, or both. Perhaps you expect a lot of changes in your data structure and want to opt for document-based DBs, perhaps DynamoDB or even MongoDB Atlas.

Thankfully, the number of choices nowadays is vast, and some technologies will better suit your use case than others.

If you need to scale writes, consider sharding or some of the newer serverless options with provisioned IOPS and alikes (be careful with provisioned IOPS, though, as they tend to be very expensive).

And, very importantly, make decisions backed by data and facts. Perform benchmark tests and try different technologies and scenarios. How is the monitoring? Check their recovery and scaling capabilities. Know their support and be well prepared to avoid unexpected costs.

Good luck with your decisions!

Accelerating App Development with AppSync Gen2 & Generative AI

Felipe Malaquias — Mon, 08 Apr 2024 18:20:39 +0000

In this insightful AWS meetup hosted by Idealo in their Berlin office, Felipe Malaquias delves into the transformative power of AWS Amplify Gen2 and generative AI in expediting app development. AWS Amplify Gen2 offers an array of libraries and services and empowers frontend developers to create full-stack applications with ease. With it, creating robust applications becomes seamless, requiring minimal infrastructure knowledge. He'll also explore the burgeoning field of generative AI, including large language models, showcasing their potential to revolutionize innovation and time to market.

Secure Proxy Server in AWS

Felipe Malaquias — Tue, 26 Mar 2024 18:40:32 +0000

Securely access third-party content with whitelisted IP from wherever you are.

If you want to jump directly to the solution using CDK, go here.

Proxy servers are used for several purposes, and in general they provide a gateway between users and the destination they want to access.

In this example, we will tackle the scenario where you want to access third-party content protected by a firewall that can only be accessed from specific white-listed IPs wherever you are.

Well, I’m sure there are a couple of ways to solve it, but here I’ll describe a solution that doesn’t depend on increasing too much costs and complexity in your infrastructure, like dealing with VPN clients or Direct Connect links, while not compromising security.

Although in general NAT Gateways should be avoided, as they might incur additional unnecessary costs when outbound connectivity from private subnets could be solved differently (e.g., by using VPC endpoints or IPv6 egress-only internet gateways), in this case, I opted to configure my VPC with a private subnet with NAT Gateway so I would have enough flexibility at my side to control how I want to proxy the requests to the third party service while still maintaining a fixed small list of public IPs which would be whitelisted at the third part service (e.g.: 3 elastic IPs from NAT Gateway, one for each AZ).

For the proxy server, I chose to use EC2 with Squid Cache as, in this case, I wanted to keep things simple while I also didn’t need 4 9s availability for this server, and if I ever need it, I can restart it or quickly spin up a new one. Of course, if you want to go cheaper, you might also consider spot instances or set functions for starting and stopping the instance when you need it, or even do it manually if it’s a once-in-a-while usage.

A couple of important things about the EC2 setup:

Do not assign a key pair for login (it is not needed; generally, having long-living keys lying around is a security risk).
Assign the EC2 instance to a private subnet to reduce exposure to the public internet.
Use a security group with no INGRESS rules (we will use SSM VPC endpoints and instance connect to access it instead).
Ensure that EGRESS TCP is enabled for all (or restrict it to ephemeral ports used to communicate with AWS services and the services you want to allow your proxy access).
Make sure you assign the ‘AmazonSSMRoleForInstancesQuickSetup’ IAM instance profile (or a custom one with the same permissions) to it.
Use Amazon Linux 2 or newer AMI (check which AMIs support instance connect)
Add a ‘Name’ tag with a value of ‘proxy-server’, for example, so you can easily automate the tunnel creation later with a script.

Use the following user data for installing and starting squid-cache, instance connect and SSM agent:

#!/bin/bash

yum update -y -q

sudo yum install ec2-instance-connect
sudo systemctl enable amazon-ssm-agent
sudo systemctl start amazon-ssm-agent

sudo yum -y install squid

sudo service squid restart

To access the EC2 instance, we will use instance connect. As the EC2 is in a private subnet, we need to create the following VPC Endpoints to be able to access it:

ssm.region.amazonaws.com
ssmmessages.region.amazonaws.com
ec2messages.region.amazonaws.com

Finally, you need a role you will assume for creating a tunnel and port forwarding to your proxy server, through a temporary ssh session started by SSM. This role must have the following permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Action": [
        "ssm:StartSession",
        "ec2-instance-connect:SendSSHPublicKey"
      ],
      "Resource": [
        "arn:aws:ec2:*:*:instance/*"
      ],
      "Condition": {
        "StringEquals": { "aws:ResourceTag/Name": "proxy-server" }
      }
    },
    {
      "Sid": "",
      "Effect": "Allow",
      "Action": [
        "ssm:StartSession"
      ],
      "Resource": [
        "arn:aws:ssm:*:*:document/AWS-StartSSHSession"
      ]
    },
    {
      "Sid": "",
      "Effect": "Allow",
      "Action": [
        "ssm:TerminateSession",
        "ssm:ResumeSession"
      ],
      "Resource": ["arn:aws:ssm:*:*:session/$${aws:username}-*"]
    },
    {
      "Sid": "",
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeInstances"
      ],
      "Resource": "*"
    }
  ]
}

The statements above give permissions for describing all EC2 instances in your account, sending SSH public keys, and starting and terminating sessions on EC2 instances with the ‘proxy-server’ name.

On your computer, to start the session and create the tunnel, you need to install and configure the aws-cli and the session manager plugin for aws-cli.

Finally, you can create the tunnel with the following script:

FORWARDED_PORT=3128
AWS_REGION=<<your AWS region>>

ec2_instance_id=$(aws ec2 describe-instances \
  --filters Name=tag:Name,Values=proxy-server Name=instance-state-name,Values=running \
  --output text --query 'Reservations[*].Instances[*].InstanceId')

ec2_az=$(aws ec2 describe-instances \
                --filters Name=tag:Name,Values=proxy-server Name=instance-state-name,Values=running \
                --output text --query 'Reservations[*].Instances[*].Placement.AvailabilityZone')


echo "Generating temporary keys"

TMP=$(mktemp -u key.XXXXXX)".pem"

ssh-keygen -t rsa -f "$TMP" -N "" -q -m PEM

aws ec2-instance-connect send-ssh-public-key \
  --region ${AWS_REGION} \
  --instance-id ${ec2_instance_id} \
  --availability-zone ${ec2_az} \
  --instance-os-user ec2-user \
  --ssh-public-key "file://$TMP.pub"

ssh -i $TMP \
      -Nf -M \
      -L ${FORWARDED_PORT}:localhost:${FORWARDED_PORT} \
      -o "UserKnownHostsFile=/dev/null" \
      -o "StrictHostKeyChecking=no" \
      -o IdentitiesOnly=yes \
      -o ProxyCommand="aws ssm start-session --target %h --document AWS-StartSSHSession --parameters portNumber=%p --region=eu-central-1" \
      ec2-user@${ec2_instance_id}

rm $TMP "$TMP.pub"

Before running the script above, ensure your aws-cli sso is properly configured to access the account you deployed your proxy server to.

The script above will create a temporary key and upload it to the ec2 instance, enabling you temporary access to this instance with this key. The key is automatically removed after 60 seconds, and if you haven’t accessed your ec2 instance with that key during that period, you’d need to create and send a new one. After the key is sent, a port forwarding proxy is created on port 3128. Finally, you can access the third-party content through your proxy by using localhost:3128, as in the curl example below:

curl -x "127.0.0.1:3128" "http://httpbin.org/ip"

To destroy the tunnel, you may execute the following command:

lsof -P | grep ':'${FORWARDED_PORT} | awk '{print $2}' | xargs kill -9

That’s it. Consider fine-tuning /etc/squid/squid.conf to secure it even further against misuse.

Example with CDK is available on github.

Simple and Cost-Effective Testing Using Functions

Felipe Malaquias — Mon, 25 Mar 2024 10:35:57 +0000

Don’t limit yourself to pipeline tests

While tests in your pipeline are a must, you should not have the false idea that things are fine because your pipeline is green, and here are a couple of reasons why:

Complexity in Distributed Systems

You’re not alone. Nowadays, a service is rarely an isolated system with no connections to other services, be it databases, Kafka clusters, other services of your team, other team’s services, third-party services, you name it.

Imagine the following system:

You deploy a shiny new feature for your web login in the web frontend service, test it locally, create unit and integration tests, deploy, and make sure it works in production.

After some time, you or someone else (of course, always someone else 😛), wanted to provide an MFA feature for mobile apps, and, therefore, modified the account service to provide some additional context to the apps and ended up breaking the login for the web frontend. Let’s say neither account service nor mobile app is your team's responsibility. How long would it take for you to know this feature is broken? Of course, you have metrics and alarms in place, but let’s make it less obvious. Instead of breaking the feature completely, you only break it for a small subset of users, for example. Depending on your thresholds for your alarms, evaluation period, points to alarm, etc., you may take a very long time to detect it (and by very long, I consider it already 5 minutes or above). Or worse, if you don’t have alarms and metrics in place (shame on you), it could detect it only during your next build or after a couple of customer complaints.

Fail Fast Fix Faster

As mentioned in the previous section, it may take time to detect a failure, and as a consequence, even more time to detect the root cause, as you may not be able to find out so quickly when it started to happen (e.g., short log retention time, missing metrics, etc.). If you are constantly testing your system, you know exactly when something stopped working, making it easier to find the subset of changes during that timeframe that could have led to the issue.

Hidden Intermittent Failures

There are a few things that irritate me more than green peas and the act of restarting a failed build and if it works, proceeding as if everything is fine and it was just a ‘glitch’.

There is no such thing as a ‘glitch’ in mathematics and, therefore, computer science. Behind everything, there is a reason, and you should always know the reason so you do not get caught off guard in the near future. If an issue can happen, it will happen. Did you get it? Are you sure?

I’ve seen teams run buggy software for days, months, and even years without fixing intermittent failures because they seemed just randomness, and no one could explain the reason because the frequency was relatively low that no one bothered to check the root cause, and at some point in time, this issue comes and bites you, because if you don’t know the reason, you might make the same mistake again, in another scenario, service, or system that will lead to a higher impact on your business.

Chasing the reason for things to happen should be the number one goal of software engineers because only then can we learn and improve.

So, What’s My Suggestion?

Continuously Test Your Applications

By continuously, I really mean continuously, and not only during deployments. Test it at a one-minute frequency, for example, so you have enough resolution to know when things started to go bad and can also know how frequently an issue occurs. Does it always occur? Every x requests? Only during the night quiet period? All these questions can help you find the root cause faster. Also, make sure those tests alarm you in case they are not working properly.

A Possible Solution with Functions

There are a couple of companies out there that offer continuous testing services, such as Uptrends. However, if you’re looking to run some continuous integration tests, I believe you could have a much more cost-effective, simpler, and more useful solution if you build it on your own using Postman as a basis.

Postman is a great tool that has been on the market for a very long time. It is very reliable, has very good features for end users, and has enough flexibility to adapt to your needs.

More Useful

I realize occasionally that most developers are not very familiar with their APIs. By that, I mean that they often don’t have a shared collection of API calls prepared for running on demand at each stage if needed, for example.

Postman allows you to share collections of HTTP, GraphQL, gRPC, Websocket, Socket.IO, and MQTT requests and organize them into multiple environments, each with its variables (e.g., hostname, secrets, user names, etc.).

By sharing these collections with the team, everyone can quickly understand your APIs by calling them whenever needed, at any stage, for example, and, with this, integrate them into their own systems.

Simpler

Before implementing the solution mentioned in this article, I encountered integration test suites written in Java. Therefore, they had their own projects configured with Maven and had a lot of verbose and redundant code for performing and verifying HTTP calls. These projects were checked out during the build and executed for each stage. The execution also needed some spring boot bootstrap time, making the pipeline slower.

By using Postman, creating new test cases is much quicker and simpler, as it can be created in a user-friendly UI by inserting the address, adding variables as you need for each environment, adding very straightforward individual assertions per test case, and running it with a click of a button for verifying it. See some examples here.

Cost Effective

You can use Postman for free with some limitations if you like (you can share your collections with up to 3 people), and this would be enough to implement the solution I’ll describe here. However, if you want to share the collections with your team, it’s good to look at their plans and pricing.

Also, by building your infrastructure to run it, you may even be able to run these tests almost for free! The idea behind this infrastructure is to run the tests using functions through a Postman runner to run test collections exported from Postman. Lambda functions are a very affordable way of executing code for a short period of time.

Solution

As you can see in the above diagram, EventBridge schedules a lambda function to be executed periodically. This lambda function retrieves the assets exported from Postman (test collection, environment, and global variables), injects secrets from the secrets manager, executes the tests using the Newman npm package, and, in case of failures, updates metrics in CloudWatch and stores test results in the S3 bucket. An alarm is triggered if the metrics exceed a threshold (in this case, a count of 1).

The complete solution with SAM is available here.

The infrastructure is defined in the template.yaml file, and the lambda function handler with all testing logic is defined in api-testing-handler.ts.

This infrastructure can be reused for any Postman testing (HTTP, REST APIs, etc.). An example of an exported Postman collection is available here. Please notice that these files were not created manually but exported from the UI. All these files must be placed inside the S3 bucket generated by the infrastructure in the folder defined by the *TestName *parameter input during the infrastructure deployment (in this case, ‘MyService’ by default).

Also, notice that the *SecretId *secret must exist in order for the lambda function to inject any secret needed by the test collection.

Have fun playing around with it.

A Ride Through Optimising Legacy Spring Boot Services For High Throughput

Felipe Malaquias — Sun, 24 Mar 2024 08:40:22 +0000

Oops! Did I fix it or screw it up for real?

Even though we could easily scale this system up vertically and/or horizontally as desired, and the load tested was 20x the expected peak, the rate of failed responses on our load tests before my quest was about 8%.

8% for me is a lot!

8% of €1.000.000.000.000,00 is a lot of money for me (maybe not for Elon Musk).
8% of the world’s population is a good number of people
8% of a gold bar — I’d love to own it!
8% of Charlie Sheen’s ex-girlfriends, damn… That must be tough to handle!

Why is that? Because of previous alarms and metrics I’ve set along the way, I knew something was off, and our throughput was suboptimal, even considering the fairly small number of pods we were running for these services, even if we are talking about outdated libs/technology. And worst… we are only talking about a few hundred requests per second—it should just work, and at a higher scale!

As you will see at the end of this article, performing such load tests in your services can reveal a lot of real issues hidden in your architecture, code, and/or configuration. The smell that “something is off” here indicated that something was indeed off, also for regular usage of those services. Chasing the root cause of problems is always worth it — never ignore errors, considering it’s a “hiccup”. There’s no such thing as a “hiccup” in software. The least that can happen is that you learn more about the software you wrote, the frameworks you use, and the infrastructure that hosts it.

Tech Stack

As there are so many variables in software development (pun intended), I think context is important in this case, and we will limit talking about optimizations on the following pretty common legacy tech stack (concepts apply to others as well — yes, including the latest shit):

Springboot 2.3 + Thymeleaf
MongoDB
Java

Architecture

Architectural changes are not the focus of this article, so I assume some basic understanding of resilient architectures and I won’t write about it besides giving a few notes below on what is expected you are aware of when talking about highly available systems (but again, not limited to):

The network is protected (e.g., divided into public and private or equivalent subnets)
The network is resilient (e.g.: redundant subnets are distributed across different availability zones)
Use clusters and multiple nodes in distributed locations when applicable (e.g. Redis cache clusters, db clusters, service instances deployed in multiple availability zones, and so on)
Use load balancers and distribute load accordingly to your redundant spring boot services.
Have autoscaling in place based on common metrics (e.g.: CPU, memory, latency)
Add cache and edge servers to avoid unnecessary service load when possible (e.g.: Cloudfront)
Add a firewall and other mechanisms for protecting your endpoints against malicious traffic and bots before it hits your workload and consume those precious worker threads (e.g.: WAF)
Health checks are setup.
The minimum number of desired instances/pods is set according to your normal load

For simplification purposes, I’m reducing the context of this article (and therefore the diagram below) to study only the SpringBoot fine-tuning part of it, in a system similar to the following one:

First things first

As mentioned, the progress I’m describing here was only possible due to measurements and monitoring introduced before the changes. How can you improve something if you don’t know where you are and have no idea where you want to go? Set the f*cking monitoring and alarms up before you proceed with implementing that useless feature that won’t work properly anyway if you don’t build it right and monitor it.

A few indicators you may want to monitor in advance (at least, but not limited to):

SpringBoot Service:

api response codes (5xx and 4xx)
latency per endpoint
requests per second per endpoint
tomcat metrics (servlet errors, connections, current threads)
CPU
memory

DB Cluster:

top queries
replication latency
read/write latency
slow queries
document locks
system locks
CPU
current sessions

For SpringBoot, this is easily measurable by enabling management endpoints and collecting the metrics using Prometheus and shipping metrics to Grafana or Cloudwatch, for example. After the metrics are shipped, set alarms on reasonable thresholds.

For the database, it depends on the technology, and you should monitor it at both the client (spring boot db metrics) and server sides. Monitoring on the client side is important to see if any proxy or firewall is blocking any of your commands from time to time. Believe me, these connection drops may happen even if you test it and it seems to work just fine, in case something is not properly configured and you want to catch it! For example, a misconfiguration of outbound traffic on the DB port on your proxy sidecar may lead to dirty HTTP connections at the spring boot side that were already closed on the server side.

Alright, let’s crash it.

It’s time to set your load test based on your most crucial processes that you know will be under high pressure during peak periods (or maybe that’s already your normal case, and that’s what we should aim for anyway… maximum efficiency with the least amount of resources possible).

In this case, we chose to use this solution from AWS for load testing just because the setup is very simple and we already had compatible JMeter scripts ready for use, but I’d rather suggest using distributed Locust instead for better reporting and flexibility.

In our case, we started simulating load with 5 instances and 50 threads each, with a ramp-up period of 5 minutes. This simulates something like 250 clients accessing the system at the same time. Well, this is way above the normal load we have on those particular services anyway, but we should know the limits of our services… in this case, it was pretty much low, and we reached it quite fast — shame on us!

Those metrics above were extracted from our API gateways' automatically generated Cloudwatch dashboard.

There, you can see a couple of things:

1- The load test starts around 15:55 and ends around 16:10
2- The load is only applied to one of the services (see “count” metric)
3- The load applied to one upstream service caused latency in three services to increase (the service the load was applied to + 2 downstream services)
4- A high rate of requests failed with 500s on the service we applied the load to

Therefore, we can conclude that the increased request rate caused bottlenecks in downstream services, which caused latency and probably caused upstream services to timeout, and the errors were not handled properly, resulting in 500 errors to the client (load test). As autoscaling was set up, we can also conclude another important observation: autoscaling did not help, and our services would become unavailable due to bottlenecks!

Connection Pools and HTTP Clients

The first thing I checked was the connection pools and HTTP clients set up so I could get an idea of the maximum parallel connections those services could open, how fast they would start rejecting new connections after all the current connections were busy, and how long they would wait for responses until they started to time out.

In our case, we were not using WebFlux, so I didn’t want to start refactoring services and deal with breaking changes. I was more interested in first checking what I could optimize with minimal changes, preferably configuration only, and only performing larger changes if really needed. Think about the “Pareto rule”, “choose your battles wisely”, “time is money”, and so on. In this case, we were using the Rest Template.

Let’s review what a request flow would look like at a high level:

So, you can see the incoming request is handled by one of the servlet container’s (tomcat) threads, dispatched by Spring’s DispatcherServlet to the right controller, which calls a service containing some business logic. This service then calls a downstream remote service using an HTTP client through a traditional Rest template.

The downstream service handles the request in a similar manner, but in this case, it interacts with MongoDB, which also uses a connection pool managed by Mongo Java Driver behind Spring Data MongoDB.

The first thing I noticed was that the setup was inconsistent, sometimes with missing timeout configurations, using default connection managers, and so on, making it a bit tricky to predict consistent behavior.

Timeouts

As a good practice, one should always check timeouts. Why? Not every application is the same; hence, the defaults might not fit your use cases. For example, database drivers tend to always have absurdly long timeouts by default for queries (such as infinite), as most simple applications might do some background task for performing a query once in a while and return the result to some task, job, or something similar. However, when we are talking about high-scalable and high-throughput systems, one must not wait forever for a query to complete; otherwise, if you have any issue with some specific collection, query, index, or anything like that blocks your DB instances, you will end up piling up requests and overloading all your systems very quickly.

Think of it like a very long supermarket line with a very slow cashier, where the line keeps growing indefinitely and you are at the very end of the line. You can either wait forever and maybe get to the front of the line before the shop closes (and other people will keep queueing behind you), or you (and all the others) can decide after 3s to get out of that crowded place and come back later.

Timeout is a mechanism to give the clients a quick response, avoiding keeping upstream services waiting and blocking them from accepting new requests. Circuit breakers, on the other hand, are safeguards to avoid overloading your downstream services in case of trouble (connection drops, CPU overload, etc.). Circuit breakers are, for example, those waiters or waitresses who send customers back home without a chance to wait for a table when the restaurant is full.

Connection Pool

Remote connections, such as the ones used for communicating with databases or Rest APIs, are expensive resources to create for each request. It requires opening a connection, establishing a handshake, verifying certificates, and so on.

Connection pools allow us to reuse connections to optimize performance and increase concurrency in our applications by maintaining multiple parallel connections, each in its own thread. Given certain configurations, they also give us the flexibility to queue requests for a certain amount of time if all connections from the pool are busy so they are not immediately rejected, giving our services more chances to serve all requests successfully within a certain period.

You might have a quick read of this article for more information about connection pools.

So that’s more or less how it looks after the changes:

@Bean
HttpClient httpClient(PoolingHttpClientConnectionManager connectionManager) {
    return HttpClientBuilder.create()
            .setConnectionManager( connectionManager )
            .build();
}

@Bean PoolingHttpClientConnectionManager connectionManager() {
    PoolingHttpClientConnectionManager connectionManager = new PoolingHttpClientConnectionManager();
    connectionManager.setMaxTotal( POOL_MAX_TOTAL );
    connectionManager.setDefaultMaxPerRoute( POOL_DEFAULT_MAX_PER_ROUTE );
    return connectionManager;
}

@Bean
ClientHttpRequestFactory clientHttpRequestFactory(HttpClient httpClient) {
    HttpComponentsClientHttpRequestFactory factory = new HttpComponentsClientHttpRequestFactory(httpClient);
    factory.setConnectTimeout(CONNECT_TIMEOUT_IN_MILLISECONDS);
    factory.setReadTimeout(READ_TIMEOUT_IN_MILLISECONDS);
    factory.setConnectionRequestTimeout(CONNECTION_REQUEST_TIMEOUT);
    return factory;
}

@Bean
public RestTemplate restTemplate(ClientHttpRequestFactory clientHttpRequestFactory) {
    RestTemplate restTemplate = new RestTemplateBuilder()
        .requestFactory(() -> clientHttpRequestFactory)
        .build();
    return restTemplate;
}

The beans above will ensure the HTTP client used by the Rest template uses a connection manager with a reasonable amount of max connections per route and max connections in total. If there are more incoming requests than we are able to serve with those settings, they will be queued by the connection manager until the connection request timeout is reached. If no attempt to connect is performed after the connection request timeout because the request is still in the queue, the request will fail. Read more about the different types of HTTP client timeouts here.

Make sure to adjust the constants according to your needs and server resources. Be aware that one thread is open for each connection, and threads are limited by OS resources. Therefore, you can’t simply increase those limits to unreasonable values.

Let’s try it again!

So there I was. Looking forward to another try after increasing the number of parallel requests handled by the HTTP clients and seeing a better overall performance of all services!

But to my surprise, this happened:

So now our average latency has increased almost 8 times, and the number of errors has also increased! How come?!

MongoDB

Luckily, I also had a monitoring setup for our MongoDB cluster, and there, it was easy to spot the culprit! A document was locked up by several concurrent write attempts. So, the changes indeed increased throughput, and now our DB was overloaded with so many parallel writes in the same document, which caused a huge amount of time waiting for it to be unlocked for the next query to update.

You may want to read more about MongoDB concurrency and locks here.).

As a consequence, the DB connection pool was busy queueing requests, and therefore, the upstream services also started to get their thread pools busy handling incoming requests due to the sync nature of rest templates waiting for a response. This increased CPU consumption in upstream services and caused higher processing times and failures, as we observed in previous graphics!

As MongoDB monitoring pointed me to the exact collection containing the document that was locked and I had CPU profiling enabled (which I’ll describe in the next section), I could easily find the line code causing the lock through an unnecessary save() call in the same document at each service execution for updating one single field, which, to my surprise, never changed its value.

Document locks are necessary for concurrency but are no good as they can easily start blocking your DB connections, and they usually indicate problems with either your code or collections design, so always make sure to review it in case you see some indication your documents are being locked.

After removing the unnecessary save() call, things started looking better — but still not good.

In comparison to the initial measures, the latency is higher, though the error rate dropped to almost 1/3 of the initial amount. Also, in comparison to the first try, it seems the errors are popping up slower than before.

Before proceeding to fix the next bottleneck, let’s review one more thing. Ok, we had an issue in the code that caused the locks, but why did we let queries run for so long? Remember what I wrote initially at connection pools, HTTP clients, and timeout sections. The same applies here: remember to always review default values for your connections and timeouts. MongoDB allows you to overwrite defaults through its connection options.

Connections will be created based on both minPoolSize and maxPoolSize. If queries take longer to be executed and new queries come in, new connections will be created until maxPoolSize is reached. From there, we can also define how long a query can wait to be executed with waitQueueTimeoutMS. If we are talking about DB writes, which was our case here, you should also review wtimeoutMS, which, by default, keeps the connection busy until the DB finishes the write. If setting a value different than the default (never timeout), you may also set a circuit breaker around the DB to ensure you don’t overload it with additional requests. If your DB cluster contains multiple nodes, distribute the load with reads by setting readPreference=secondaryPreffered. Be aware of consistency, read isolation, and recency.

CPU Profiling

If you are working on performance issues, the first thing you should care about is profiling your application. This can be done locally using your favorite IDE or remotely attaching the profiler agent to your JVM process.

Application profiling enables you to see which frames of your application consume the most processing time or memory.

You can read more about Java profilers here. I used the CodeGuru profiler from AWS in this case.

See below an example of an application containing performance issues profiled with CodeGuru.

The large frames indicate a large amount of processing time, and the blue color indicates namespaces recognized as your code. On top of that, sometimes you may have some recommendations based on a detected issue. However, don’t expect it to always point you precisely to the issues in your code. Focus on the large frames and use them to detect parts of the code that normally should not consume so much processing time.

In the example above, one of the main issues seems to be creating SQS clients in the Main class. After fixing it, come back and check what the profiling results look like after some period of time monitoring the new code.

In our case, the profiler indicated a couple of problematic frames in different applications, which caused bottlenecks and, as a consequence, the 500 errors and long latency in the previous graphics.

In general, this either indicates low-performant code (e.g., strong encryption algorithms executed repeatedly) or leaks in general (e.g., the creation of a new object mapper in each request). In our case, it pointed to some namespaces, and after analyzing them, I could find opportunities for caching expensive operations, for example.

Thymeleaf Cache

This was a funny one. A cache is always supposed to speed up our code execution, as we don’t need to obtain a resource for the source again, right? Right…?

Yes, if configured properly!

Thymeleaf serves frontend resources in this service, and it has cache enabled for static resources based on content. Something like the following properties:

spring.resources.chain.enabled=true
spring.resources.chain.strategy.content.enabled=true
spring.resources.chain.strategy.content.paths=/**

However, there are two issues introduced with these three lines.

1- Caching is enabled based on resource content. However, with each request, the content is read from the disk over and over again so its hash can be recalculated, as the result of the hash calculation for the cache itself is not cached. To solve this, don’t forget to add the following property:

spring.resources.chain.cache=true

2- Unfortunately, the service is not using any base path for unifying the resolution of our static resources, so basically, Thymeleaf would try by default to load every link as a static resource from disk, even though they were just controller paths, for example. Keep in mind that disk operations are, in general, expensive.

As I didn’t want to introduce an incompatible change by moving all static resources to a new directory within the resources folder, as it would cause link changes, and I had very well-defined paths for the static resources, I could simply solve it with setOptimizeLocations() from ResourceHandlerRegistration.

Disabling Expensive Debug Logs

Another common mistake is to enable excessive logging, especially logs that print too much too often (e.g., often full stack trace logging). If you have high throughput on your systems, make sure to set up an appropriate log level and log only the necessary information. Review your logs frequently and evaluate if you want to be alerted to warnings and errors when you have your logs clean (e.g., no wrong log levels for debug/trace info).

In this specific case, we had one log line logging a full stack trace for common scenarios. I disabled it as it was supposed to be enabled just for a short period of time for debugging purposes and disabled afterward but it was probably just forgotten.

Auto Scaling Tuning

Auto-scaling settings are easy to get working, but it can be tricky to get them working optimally. The basic thing you can do is enable auto-scaling based on CPU and Memory metrics. However, knowing your services in terms of how many requests per second they are able to handle can help you scale horizontally before your services start to degrade performance.

Check possible different metrics you may want to observe for scaling, set reasonable thresholds, fine-tune scale-in and scale-out cooldown periods, define minimum desired instances according to your expected load, and define a maximum number of instances to avoid unexpectedly high costs. Know your infrastructure and your service implementations inside out.

Giving Another Try

Performing the same load test one more time yielded the following results in comparison with the initial results:

Count: Sum

We are now handling more than double the number of requests within the same 15 minutes of testing.

Integration Latency: Average

The average integration latency in the service, which had load applied, was reduced more than twice compared to before. Meanwhile, downstream services remained with almost constant latency during the tests compared to before, so no more domino effect was observed.

5XXError: Sum

More importantly, errors were gone. The remaining errors we see on the graphic on the right are unrelated to the load test, as we can see in the following report.

Finally, we can see that auto-scaling changes helped us reduce the average response time to the normal state and keep it stable after about 10 minutes of the test.

Conclusion

Are we done? Of course not.

These optimisations took me about 24 hours of work in total, but they should be performed regularly in multiple systems and different parts of them. However, when considering a large enterprise, such work can quickly become very expensive.

Choosing a good balance between keeping it as it is and becoming obsessed with optimizing every millisecond is tricky, and you should keep in mind that such optimizations bring against opportunity costs.

Do not forget that it’s not only about tuning services to be performant under high load but also making sure your services can produce consistent and correct results under normal conditions as well (e.g., such issues as higher I/O dependencies can lead to “random” unexpectedly longer response times if some jobs are being performed in the background on the operational system of your service instance, for example).

Finally, I often see developers tend to use frameworks and infrastructure without knowing their internals, and this behavior introduces several issues without being noticed. Ensure you understand how your systems behave, what bottlenecks they create, what possible security issues could be exploited, and which settings are available to optimize them to your needs.

I hope this article helps you set the mindset of caring about such aspects of your systems. Good luck!

DocumentDB Vacuum Locks

Felipe Malaquias — Sat, 23 Mar 2024 12:25:41 +0000

Beware possible locks on large updates/deletions

Historically, traditional databases dealt with writes with pessimistic locks on records during writes to avoid inconsistency, which had the obvious drawback of being unable to handle concurrency properly, as transactions could fail.

This is solved by MVCC (Multiversion Concurrency Control), by creating a new version of a record on every update, circumventing the need to lock records, and allowing concurrency (see this video from Cameron McKenzie for a nice and simple illustrated explanation).

However, to clean up the old versions, a vacuum process must run in the background, which may cause locks in your complete collection, bottlenecks, and possibly unexpected downtimes in your application.

There is no permanent fix for this at the moment, but if you need to perform such updates, you may contact support and ask them to disable the process that reclaims unused storage space. This will not negatively impact your workload, and space reclaimed by the garbage collector will continue to be recycled. However, the size of your collections will never decrease, even if a significant amount of data has been deleted.

The good news

AWS is currently working on a fix for it, which may be available at any time, so you should keep an eye on the DocumentDB release notes.

This is how we experienced it

On a lovely Tuesday morning, we reset one of our Kafka topics (~71 GB) to re-consume all our data for a particular domain to aggregate it with new fields in our database. All messages were successfully consumed and written in the primary DB instance in a few minutes as expected:

What we did not expect, though, are those waves of latency increase in our workload hours after the records were consumed and initially without much of a pattern until approx. 5 pm:

Those were all caused by locks on a particular collection in the read replicas as shown by the pink bars in the Document performance insights metrics below:

As you see, the locks were gone after around 11 pm, matching the end of the DocumentDB freeable memory metrics changes below:

Reaching out to AWS support, they investigated the issue. They confirmed it was caused by the process of reclaiming unused space during the garbage collection on the vacuum process. This process must be synchronized between the writer and readers because the readers might still have in-flight transactions that can see the deleted data. In some rare circumstances and the presence of a large amount of reclaimable data, this synchronization can adversely affect the workload on the replicas.

Conclusion

Be aware of the MVCC strategy and check how it may affect your database (not only DocumentDB) in case of large updates as described above, and probably most importantly, always test it in staging first ;)

Also, be aware of the known issue with the DocumentDB vacuum process and watch the release notes.