DEV Community: Rehan van der Merwe

TypeScript Type Safety with AJV Standalone

Rehan van der Merwe — Sat, 15 Jan 2022 22:00:00 +0000

Originally published on my blog: https://www.rehanvdm.com/blog/typescript-type-safety-with-ajv-standalone

In this blog we start of by exploring what Type means in TypeScript and how to achieve type safety at runtime. Then a quick comparison of the current tools to achieve this and deep dive into the implementation using the AJV Standalone method.

TL;DR TypeScript does a great job at compile time type safety, but we still need to do runtime checks just like in JavaScript. There are many packages and tools to help with this, we focused on AJV Standalone that outputs JS validation functions at compile time to be used at runtime. Going from TS Types to JSON Schema to JS functions allows us to validate TS Types where the other packages all work with classes and reflection.

The code referenced in this blog can be found in the example project here: https://github.com/rehanvdm/ajv-standalone-type-saftey.

TS in the land of type safety

TypeScript (TS) is a statically-typed safe language that outputs JavaScript(JS) when compiled. The type safety that TS provides at compile time allows for much higher quality projects than using plain JS.

TS will produce JS and then corresponding type definition files (.d.ts) that is used at compile time, many npm packages only expose these two types of files and not the actual TS file. The type definition files informs TS how to use the JS files in the absence of the actual TS file.

Something often forgotten is that TS, similar to JS, does not provide runtime type safety. It can not know that the JSON object being casted to Type A actually has the correct properties and definitions that fit Type A at compile time. This does not fall within the scope of TS goals. Runtime saftey has to be added just like any other JS project.

Ensuring runtime type safety

The following is just a quick rundown of the current tools available to help achieve runtime type safety.

1) Classes with decorators

The most popular implementations is class-validator paired withclass-transformer. It uses decorators on class properties to define the rules that will be evaluated at runtime.

export class Post {
  @Length(10, 20)
  title: string;

  @IsInt()
  @Min(0)
  @Max(10)
  rating: number;
}
let post = new Post();
post.title = 'Hello'; // should not pass
post.rating = 11; // should not pass

validate(post).then(errors => {
  // errors is an array of validation errors
});

The major downside of this approach is that decorators do not work interfaces and types, this forces you to represent every type as a class. You can not construct subtypes from other classes, leading to many duplicate unconnected types.

Composing types from other types using keywords like (Partial, Omit, Pick) is only possible with TS Types, this superpower is lost as soon as you force every type to be a class.

2) All types as classes

Then there are a set of packages that also forces you to represent every type as a class but instead of using decorators to define the rules, the properties of the class are other classes/functions. One of these packages is io-ts.

import * as t from 'io-ts'

const User = t.type({
  userId: t.number,
  name: t.string
})
const userData: Person = person.decode({userId: 1, name: "Piet"});

Composing types from other types is possible, but it gets hairy real quickly. Another library,fp-ts, is needed to do this composition and like io-ts has its own learning curve. Another downside is that we still can’t use the native TS Types and make use of their composability.

In many cases this approach is “technically” better and faster than the previous one that uses class decorators and reflection for validation. I say “technically” because it feels like it ignores the native Type in TypeScript and forces you to use a different type system (opinion).

3) AJV

Another Json Validator (AJV) uses JSON Schema to define the types. JSON Schema is a vocabulary that allows you to annotate and validate JSON documents. Both the data and the JSON Schema is passed to a validator that validates the rules as defined in the schema.

const schema = {
  type: "object",
  properties: {
    foo: {type: "integer"},
    bar: {type: "string"}
  },
  required: ["foo"],
  additionalProperties: false
}

const data = {foo: 1, bar: "abc"}
const valid = ajv.validate(schema, data)
if (!valid) 
  console.log(ajv.errors)

At first glance this does not seem ideal, since you have to define all your TS Types and their corresponding JSON Schemas. Fortunately there are tools that easily transform TS Types into JSON Schema for us, this is what we do in the example project.

One downside is that AJV transforms/compiles JSON Schema to an actual JS function and then caches it for future use. This means that the initial startup time can be significant, especially in short-lived environments like Lambda functions.

4) AJV Standalone

This is where AJV Standalone mode really shines, we save the generated validation function from the compiled output as an actual JS function at compile time. This function is then just imported/required at runtime and used as a standard JS function that validates the input.

Generating the JS Validation function

const fs = require("fs")
const path = require("path")
const Ajv = require("ajv")
const standaloneCode = require("ajv/dist/standalone").default

const schema = {
  $id: "https://example.com/bar.json",
  $schema: "http://json-schema.org/draft-07/schema#",
  type: "object",
  properties: {
    bar: {
      type: "string"
    },
  },
  "required": [
    "bar",
  ]
}

// For ESM, the export name needs to be a valid export name, it can not be `export const https://example.com/bar.json = ...;` so we
// need to provide a mapping between a valid name and the $id field. Below will generate
// `export const Bar = ...;`
// This mapping would not have been needed if the `$id` was just `Bar`
const ajv = new Ajv({schemas: [schema], code: {source: true, esm: true}})
let moduleCode = standaloneCode(ajv, {
  "Bar": "https://example.com/bar.json"
});

// Now you can write the module code to file, notice the file is saved as `.mjs` so that it can be imported as a module.
fs.writeFileSync(path.join(__dirname, "../consume/validate-esm.mjs"), moduleCode)

Consuming the JS Validation function

import {Bar} from './validate-esm.mjs'

const barPass = {
    bar: "something"
}

const barFail = {
    // bar: "something" // <= empty/omitted property that is required
}

let validateBar = Bar
if (!validateBar(barPass))
  console.log("ERRORS 1:", validateBar.errors) //Never reaches this because valid

if (!validateBar(barFail))
  console.log("ERRORS 2:", validateBar.errors) //Errors array gets logged

This means we would end up having multiple JS Types, their corresponding JSON Schema files and then the validation JS function created at compile time. It might seem a bit excessive to achieve runtime type safety, but it is the best option in my opinion. Here is why:

We can use TS Types and make use of all the TS superpowers like type composition (Partial, Omit, Pick).
Many other systems and tools can generate/export TS native types from entities like an OpenAPI (aka Swagger) file or a database schema. These types can then be consumed without changing any of their definitions.
It is faster than most other implementations.

One downside is that an extra compile step is needed to transform the TS Type to JSON Schema to the JS Validation function.

Using AJV Standalone to make a TS project type safe

We will use the AJV Standalone method and multiple other tools to create a TS project that is both compile and runtime safe, as well as still providing a good developer experience (DX).

Project directory structure of example project

Generating the JSON Schema and JS validation function at compile time

Pseudo logic:

Generate the JSON Schema files from the TS Types and save them as individual JSON Schema files using thetypescript-json-schema package, optionally can be committed.
Create the AJV validation code in ESM format from the JSON Schema files, save this file.
Optionally pass the generated code through ES Build to bundle and minify the function and all it’s dependencies into 1 file.
Create the TS Definition file (typings) from the JS validation file using TSC so that the file can be imported in TS.

These are the basic steps that happen when you npm run build-types, this runs the build_types Gulp task by doing:gulp --color --gulpfile gulpfile.js build_types. Gulp is a build tool that runs JS code and the defined Gulp task build_typesruns the above pseudo logic by calling the CLI where necessary or calling the JS libraries. The same could have been done with a Bash script, but I always favor JS build/automation tools as they are cross-platform.

More notes (extra detail, might want to skip on first read):

The typescript-json-schema package can only be pointed to a single directory and can not exclude certain paths. It will go through your whole node_modules folder if you point it at the project root. That is why I point it at a specific folder that contains all the project types.
The generated JSON Schema file will be a single JSON Schema object with each type in the definitions array property. Each of these definitions are then moved into their own JSON Schema objects and saved to file. Only after replacing all the#/definitions/ within the file so that schemas can correctly reference each other (from within AJV) and also have valid $id fields which become the ESM exported name.
Each of the JSON Schema files are then added to the AJV schemas before compiling the validation function.
We add the ajv-formats when compiling the JSON schemas into functions because some types have a Date property which is represented in JSON Schema as { "type": "string", "format": "date-time" }. Adding these extra AJV formats, enable automatic identification and parsing of a string in ISO 8601 format into a JS Date type object.

Below is the Pseudo logic as implemented in the Gulp file, this can be skimmed for now as the runtime usage is more important than the generation.

function typeScriptToJsonSchema(srcDir, destDir)
{
  const config = {
    path: srcDir+"/**/*.ts",
    type: "*",
  };

  let schemas = [];
  console.time("* TS TO JSONSCHEMA");
  let schemaRaw = tsj.createGenerator(config).createSchema(config.type);
  console.timeEnd("* TS TO JSONSCHEMA");

  /* Remove all `#/definitions/` so that we can use the Type name as the $id and have matching $refs with the other Types */
  let schema = JSON.parse(JSON.stringify(schemaRaw).replace(/#\/definitions\//gm, ""));

  /* Save each Type jsonschema individually, use the Type name as $id */
  for(let [id, definition] of Object.entries(schema.definitions))
  {
    let singleTypeDefinition = {
      "$id": id,
      "$schema": "http://json-schema.org/draft-07/schema#",
      ...definition,
    };
    schemas.push(singleTypeDefinition);
    fs.writeFileSync(path.resolve(destDir+"/"+id+".json"), JSON.stringify(singleTypeDefinition, null, 2));
  }

  return schemas;
}

function compileAjvStandalone(schemas, validationFile)
{
  console.time("* AJV COMPILE");
  const ajv = new Ajv({schemas: schemas, code: {source: true, esm: true}});
  addFormats(ajv);
  let moduleCode = standaloneCode(ajv);
  console.timeEnd("* AJV COMPILE");
  fs.writeFileSync(validationFile, moduleCode);
}

function esBuildCommonToEsm(validationFile)
{
  console.time("* ES BUILD");
  esbuild.buildSync({
    // minify: true,
    bundle: true,
    target: ["node14"],
    keepNames: true,
    platform: 'node',
    format: "esm",
    entryPoints: [validationFile],
    outfile: validationFile,
    allowOverwrite: true
  });
  console.timeEnd("* ES BUILD");
}

async function generateTypings(validationFile, validationFileFolder)
{
  console.time("* TSC DECLARATIONS");
  await execCommand("tsc","-allowJs --declaration --emitDeclarationOnly \""+validationFile+"\" --outDir \""+validationFileFolder+"\"");
  console.timeEnd("* TSC DECLARATIONS");
}

async function buildTypes()
{
  let paths = {
    types: path.resolve(__dirname + "/types"),
    typesJsonSchema: path.resolve(__dirname + "/types/schemas"),
    validationFile: path.resolve(__dirname + "/types/schemas/validations.js"),
  };

  /* Clear the output dir for the AJV validation code, definition and JSON Schema definitions */
  clearDir(paths.typesJsonSchema);

  /* Create the JSON Schema files from the TS Types and save them as individual JSON Schema files */
  let schemas = typeScriptToJsonSchema(paths.types, paths.typesJsonSchema);

  /* Create the AJV validation code in ESM format from the JSON Schema files */
  compileAjvStandalone(schemas, paths.validationFile);

  /* Bundle the AJV validation code file in ESM format */
  esBuildCommonToEsm(paths.validationFile);

  /* Create TypeScript typings for the generated AJV validation code */
  await generateTypings(paths.validationFile, paths.typesJsonSchema);
}

gulp.task('build_types', async () =>
{
  await buildTypes();
});

The console output of the build_types task:

[20:51:57] Starting 'build_types'...
* TS TO JSONSCHEMA: 2.586s
* AJV COMPILE: 56.419ms
* ES BUILD: 76.501ms
* TSC DECLARATIONS: 2.995s
[20:52:03] Finished 'build_types' after 5.72 s

We also have an additional Gulp task that watches all the types in the type directory and recompiles the JSON Schema and validation function as soon as the types change.

gulp.task('watch_types', async () =>
{
  await buildTypes();
  gulp.watch(['types/*.ts'], async function(cb)
  {
    await buildTypes();
    cb();
  });
});

Consuming the JS validation function for our types at runtime

The Post and NewPost(not used atm) types in /types/Post.ts

export type Post = {
  title: string;
  description?: string;
  rating: number;
  createAt: Date;
}

export type NewPost = Omit<Post, "createAt">;

The Blog type in /types/Blog.ts

import {Post} from "./Post";

export type Blog = {
  site: string;
  about: string;
  email: string;
  // twitter: string; //Test watch command
  posts: Post[];
};

A basic example of using the validation function that was generated in /types/schemas/validations.js by the Gulp task. Here we create a blog with one valid post and then add another invalid post that does not have a createdAt property. This is possible because we cast the object as that type, the TSC does not complain because this is valid at compile time, but it will fail at runtime.

import {Post} from "./types/Post";
import {Blog} from "./types/Blog";
import {DateTime} from "luxon";
import {ValidateFunction} from "ajv";
import * as validations from './types/schemas/validations';

const blog: Blog = {
  site: "rehanvdm.com",
  email: "rehan.nope@gmail.com",
  about: "My blog, the one I never have time to write for but do it anyway.",
  posts: [{
    title: "Valid Post",
    rating: 5,
    createAt: DateTime.now().toJSDate()
  }]
};
let postInValid = {
  title: "Invalid Post! Missing createAt, forcing by casting",
  rating: 1
} as Post;
blog.posts.push(postInValid);

const validateBlog = validations.Blog as ValidateFunction<Blog>;
if(!validateBlog)
  throw new Error("Validate not defined, schema not found");

/* Casting to and from JSON forces the object to be represented in its primitive types.
* The Date object for example will be forced to a ISO 8601 representation which is what we want */
if(!validateBlog(JSON.parse(JSON.stringify(blog))))
{
  console.error(validateBlog.errors);
  console.error(JSON.stringify(validateBlog.errors));
  throw new Error("Blog not valid");
}

The magic happens here:

const validateBlog = validations.Blog as ValidateFunction<Blog>;

Where we cast the function that returns a boolean as a ValidateFunction<Blog>. The function still returns a boolean, but it will add an extra property, an errors array to the function prototype if it evaluated to false.

A more intuitive way of declaring types that are runtime safe can be achieved with a bit of abstraction. As in the following example:

import {Post} from "./types/Post";
import {Blog} from "./types/Blog";
import {DateTime} from "luxon";
import {ValidateFunction, ErrorObject} from "ajv";
import * as validations from './types/schemas/validations';

class TypeError extends Error {
  public ajvErrors: ErrorObject[];
  constructor(ajvErrors: ErrorObject[]) {
    super(JSON.stringify(ajvErrors));
    this.name = "TypeError";
    this.ajvErrors = ajvErrors;
  }
}

function ensureType<T>(
  validationFunc: ((data: any, { instancePath, parentData, parentDataProperty, rootData }?: {
    instancePath?: string;
    parentData: any;
    parentDataProperty: any;
    rootData?: any;
  }) => boolean),
  data: T): T
{
  const validate = validationFunc as ValidateFunction<T>;
  if(!validate)
    throw new Error("Validate not defined, schema not found");

  /* Casting to and from JSON forces the object to be represented in its primitive types.
   * The Date object for example will be forced to a ISO 8601 representation which is what we want */
  const isValid = validate(JSON.parse(JSON.stringify(data)));
  if(!isValid)
    throw new TypeError(validate.errors!);

  return data;
}

const blog: Blog = {
  site: "rehanvdm.com",
  email: "rehan.nope@gmail.com",
  about: "My blog, the one I never have time to write for but do it anyway.",
  posts: [{
    title: "Valid Post",
    rating: 5,
    createAt: DateTime.now().toJSDate()
  }]
};
let postInValid = {
  title: "Invalid Post! Missing createAt, forcing by casting",
  rating: 1
} as Post;
blog.posts.push(postInValid);

try
{
  /* Passes */
  let another: Post = ensureType<Post>(validations.Post, {
    title: "Quick way to ensure type is valid",
    description: "Just initiate differently like this",
    createAt: DateTime.now().toJSDate(),
    rating: 5
  });

  /* Fails, similar to the basic test */
  ensureType<Blog>(validations.Blog, blog);
}
catch (err)
{
  if(err instanceof TypeError)
    console.log("TYPE ERROR WITH STACK:", err.ajvErrors, err.stack);
  else if(err instanceof Error)
    console.log("ERROR:", err)
  else
    throw err;
}

We created a new error TypeError that is used by the ensureType function. This error adds the AJV errors aray to the JS error when it is thrown, we also get a full stack trace so we can easily identify the type that failed at runtime.

The ensureType function takes the validation function as generated by AJV and the unvalidated object as arguments. It uses a bit of generics to return the object when it is valid or throw the TypeError for when it failed. This means your syntax changes from:

  let another: Post = {
    title: "Quick way to ensure type is valid",
    description: "Just initiate differently like this",
    createAt: DateTime.now().toJSDate(),
    rating: 5
  };

To this:

  let another: Post = ensureType<Post>(validations.Post, {
    title: "Quick way to ensure type is valid",
    description: "Just initiate differently like this",
    createAt: DateTime.now().toJSDate(),
    rating: 5
  });

This is a very small change for the benefit of validation and runtime type safety. Another benefit is that you don’t need to change any of your already defined types, you just need to wrap the initialization of those types.

Conclusion (TL;DR)

TypeScript does a great job at compile time type safety, but we still need to do runtime checks just like in JavaScript. There are many packages and tools to help with this, we focused on AJV Standalone that outputs JS validation functions at compile time to be used at runtime. Going from TS Types to JSON Schema to JS functions allows us to validate TS Types where the other packages all work with classes and reflection.

Representing all your types as classes is not the solution, you lose a lot of TS superpowers (type composability) as soon as you do. It is crucial to do runtime checks against your Input and Outputs, the boundaries of the system. Doing internal runtime checks is optional but always welcome.

The code referenced in this blog can be found in the example project here: https://github.com/rehanvdm/ajv-standalone-type-saftey.

PS - The rabbit hole

This blog took quite a while to write(idea + code + blog), the code did not really do what I wanted it to. The AJV standalone functionality only exported the code with CJS (module.exports/require) and I needed it to be ESM (export/import) to correctly generate TS typings and for it to play nice with ES Build and TS. I made my first noteworthy PR to OpenSource to fix these issues:

These two PRs did the trick, the second one is the documentation.

It was fun contributing to the AJV project, the project has great test coverage, guidelines and conventions.

LRU cache fallback strategy

Rehan van der Merwe — Thu, 13 Jan 2022 22:00:00 +0000

Originally published on my blog: https://www.rehanvdm.com/blog/lru-cache-fallback-strategy

A Least Recently Used(LRU)cache stores items in-memory and evicts the oldest(less used) ones as soon as the allocated memory (or item count) has been reached. Storing data in-memory before reaching for an external cache increases speed and decrease the dependency on the external cache. It is also possible to fallback to in-memory caches like an LRU cache in periods that your external cache goes down without seeing a significant impact on performance.

Conclusion (TL;DR) By leveraging an in-memory cache you reduce pressure on the downstream caches, reduce costs and speed up systems. Consider having an in-memory first policy, instead of having an in-memory fallback cache strategy to see the biggest gains.

The Experiment

I took some time to write a small little program to simulate and compare the following scenarios:

No caching, fetching data directly from the DB.
Caching the DB data in Redis.
Caching the DB data in Redis first (which fails) and falls back to an in LRU memory cache.
Caching the DB data in an LRU cache first, then reaching for Redis (which fails) if it is not in the LRU cache.

The code can be found here: https://github.com/rehanvdm/lru-cache-fallback-strategy. We won’t go into too many details about the code in this blog, see the README file for more info on the code.

In a nutshell, we are simulating an API call, where we get the user ID from the JWT and then need to look that user up in the DB to get their allowed permissions. I mocked all the DB (200ms) and Redis(20ms) network calls, with the help of environment variables used as hyperparameters.

The program runs 100 executions and in the scenario where Redis fails (3,4) it will fail on the 50th execution. A user id is selected from a seeded random number generator that selects a random user id from a sample size of 10, ensuring a high cache hit ratio. The sample space is kept small so that all data is fetched before Redis goes down at the halfway mark.

When Redis goes down at the halfway mark, we implement a circuit breaker pattern, returning false on the lazy-loaded connect() function that is called inside every Redis command before it is made. Only if 10 seconds have elapsed do we attempt a reconnection and if it fails, we store the last time we tried to connect for the circuit breaker. This is so that we don’t waste time trying to connect on every request and also reduce pressure on the cache.

Below is the pseudocode for scenarios 3 , where Redis is used first with fallback to an LRU cache. Not visible here is the environment variables that toggle Redis availability and many other options. The important part I want to highlight here is:

Redis is queried first, if the user is not found it returns undefined, but if Redis is down it will try to look the value up in the in-memory LRU cache. If not found it also returns undefined.
If the user is still undefined, then we fetch the user from the DB. The user is then stored in the Redis cache but if the Redis cache is down, we store it in the LRU cache.

The /app/index.mjs contains this logic used for scenarios 3. It is also used for scenario 1 and 2 by setting environment variables that either disable all caching or ensures Redis does not go down.

Scenario 3, accessing strategies with Redis first, the happy path with Redis up

Here the LRU is only used as a fallback for Redis, but we can get better performance if we leverage the LRU cache and put it first as in scenario 4.

The pseudocode for scenario 4 , the LRU cache is used first with fallback to Redis, the pseudo logic is as follows:

The LRU is queried first, if the user is not found, we query Redis.
If we still don’t have a user it means Redis does not have it or Redis is down. So we look it up in the DB. Then we store the user in both Redis and the in-memory LRU cache.

The app/index-lru-first.mjscontains this logic used for scenario 4.

Scenario 4, accessing strategies with LRU first, the happy path with Redis up

With this method, we prefer the in-memory cache and only fallback to the slower methods/layers when we don’t have the value. When we get the value, we cache it on all layers so that subsequent executions and other processes can get it without needing to query the source.

This should come as no surprise that it is the fastest method. It can really improve performance of a high throughput system and reduce the dependency as well as back pressure on downward systems.

Results

I used asciichart to draw charts with console.log(...) and then piped the results to files.

Here are the results of each scenario.

Scenario 1 - DB only, ~200ms avg

>> APP VARIABLES
REDIS_GET 0
REDIS_SET 0
DB_FETCHES 100
LRU_SET 0
LRU_GET 0

>> TEST OUTPUTS
TEST_RUN_TIME (s) 21
APP_EXECUTIONS_COUNT 100
APP_EXECUTION_TIME (ms) 214
AVERAGE(APP_EXECUTION_TIME) (ms) 212

>> CHART
     405.00 ┤ ╭╮ ╭╮  
     384.50 ┼╮ ││ ││  
     364.00 ┤│ ││ ││  
     343.50 ┤│ ││ ││  
     323.00 ┤│ ││ ││  
     302.50 ┤│ ││ ││  
     282.00 ┤│ ││ ││  
     261.50 ┤│ ││ ││  
     241.00 ┤│ ││ ││  
     220.50 ┤│ ╭╮╭╮ ││ ╭╮ ╭╮ ╭╮ ││  
     200.00 ┤╰───────────────────────────────╯╰╯╰───────────╯╰────────╯╰──────────────╯╰─────╯╰──────────────╯╰─

Partial log file from db-only.log generated by running npm run app-save-output-db-only

We see that all 100 queries were made against the DB_FETCHES 100, the spikes are when Redis reconnects, which should kinda of not be doing that, I just didn’t want to add another environment variable. We can see that the average APP execution time over the 100 iterations were 212ms from AVERAGE(APP_EXECUTION_TIME) (ms) 212, but is actually closer to 200ms if we ignore the spikes.

Scenario 2 - DB => Redis (no failure), 55ms avg

>> APP VARIABLES
REDIS_GET 100
REDIS_SET 10
DB_FETCHES 10
LRU_SET 0
LRU_GET 0

>> TEST OUTPUTS
TEST_RUN_TIME (s) 6
APP_EXECUTIONS_COUNT 100
APP_EXECUTION_TIME (ms) 31
AVERAGE(APP_EXECUTION_TIME) (ms) 55

>> CHART
     318.00 ┼╮                                                                                                   
     289.10 ┤│                                                                                                   
     260.20 ┤╰───╮ ╭╮╭╮ ╭╮╭╮ ╭╮                                                                    
     231.30 ┤ │ ││││ ││││ ││                                                                    
     202.40 ┤ │ ││││ ││││ ││                                                                    
     173.50 ┤ │ ││││ ││││ ││                                                                    
     144.60 ┤ │ ││││ ││││ ││                                                                    
     115.70 ┤ │ ││││ ││││ ││                                                                    
      86.80 ┤ │ ││││ ││││ ││                                                                    
      57.90 ┤ │ ││││ ││││ ││                                                                    
      29.00 ┤ ╰─╯╰╯╰─╯╰╯╰───────────────╯╰───────────────────────────────────────────────────────────────────

Partial log file from db-with-redis.log generated by running npm run app-save-output-db-with-redis

In this scenario Redis does not fail, we can see we did 10 DB fetches, 10 Redis SET commands and 100 Redis GET commands.

Scenario 3 - DB => Redis (failure) => LRU, 62ms avg

>> APP VARIABLES
REDIS_GET 50
REDIS_SET 10
DB_FETCHES 20
LRU_SET 10
LRU_GET 50

>> TEST OUTPUTS
TEST_RUN_TIME (s) 6
APP_EXECUTIONS_COUNT 100
APP_EXECUTION_TIME (ms) 0
AVERAGE(APP_EXECUTION_TIME) (ms) 62

>> CHART
     425.00 ┼ ╭╮                                                 
     382.50 ┤ ││                                                 
     340.00 ┼╮ ││                                                 
     297.50 ┤│ ││                                                 
     255.00 ┤╰───╮ ╭╮╭╮ ╭╮╭╮ ╭╮ ││                                                 
     212.50 ┤ │ ││││ ││││ ││ │╰───╮╭╮╭─╮ ╭╮ ╭╮       
     170.00 ┤ │ ││││ ││││ ││ │ ││││ │ ││ ││       
     127.50 ┤ │ ││││ ││││ ││ │ ││││ │ ││ ││       
      85.00 ┤ │ ││││ ││││ ││ │ ││││ │ ││ ││       
      42.50 ┤ ╰─╯╰╯╰─╯╰╯╰───────────────╯╰─────────────────╯ ││││ │ ││ ││       
       0.00 ┤ ╰╯╰╯ ╰─────────────────────╯╰────────╯╰──────

Partial log file from db-with-redis-fail-halfway-with-fallback-lru.log generated by running npm run app-save-output-db-with-redis-fail-halfway-with-fallback-lru

Here we can see that the DB data is fetched, stored in the cache and then read from until the halfway mark, just liked before. Then Redis fails, which forces the app to make calls back to the DB and then cache the response in the in-memory LRU cache. By looking at the baselines for when data was fetched from Redis in the first half, and comparing it with the baseline of when the data was fetched from the LRU cache, we observe that the LRU is much faster.

Why not store data in the LRU first then? That is exactly what we do in scenario 4.

Scenario 4 - DB => LRU => Redis (failure), 27ms avg

>> APP VARIABLES
REDIS_GET 10
REDIS_SET 10
DB_FETCHES 10
LRU_SET 10
LRU_GET 100

>> TEST OUTPUTS
TEST_RUN_TIME (s) 3
APP_EXECUTIONS_COUNT 100
APP_EXECUTION_TIME (ms) 0
AVERAGE(APP_EXECUTION_TIME) (ms) 27

>> CHART
     326.00 ┼╮                                                                                                   
     293.40 ┤│                                                                                                   
     260.80 ┤╰───╮ ╭╮╭╮ ╭╮╭╮ ╭╮                                                                    
     228.20 ┤ │ ││││ ││││ ││                                                                    
     195.60 ┤ │ ││││ ││││ ││                                                                    
     163.00 ┤ │ ││││ ││││ ││                                                                    
     130.40 ┤ │ ││││ ││││ ││                                                                    
      97.80 ┤ │ ││││ ││││ ││                                                                    
      65.20 ┤ │ ││││ ││││ ││                                                                    
      32.60 ┤ │ ││││ ││││ ││                                                                    
       0.00 ┤ ╰─╯╰╯╰─╯╰╯╰───────────────╯╰───────────────────────────────────────────────────────────────────

Partial log file from db-with-lru-fallback-redis-which-fails-halfway.log generated by running npm run app-save-output-db-with-lru-fallback-redis-which-fails-halfway

Now when Redis fails, we don’t even notice it because all the users has already been cached in the LRU. This method reduces the dependency on Redis and made our demo application twice as fast , going from ~60ms to ~30ms average execution time.

Not all data can be stored indefinitely

In our example, we assumed that the user data does not change and can be cached forever. This is a luxury that most applications can’t make. Most of the time, you would store data in Redis with an expiration, using commands like SET key data and EXPIRE key 60 or just SETEX key 60 data. The data would then be removed from the cache after 60 seconds.

Caching information even for short periods can have a huge impact on high throughput systems. TheLRU library used in the examples also support setting data with an expiration time using set(key, value, maxAge). This means it is possible to set the value both in-memory and in Redis with the same TTL (Time To Live).

One thing to watch out for is that if you get the data from Redis it has already been in the cache for a certain period of time. So you can not store the data in the LRU with the same TTL as when you stored it in Redis. You can either:

Use the TTL Redis command to get the key’s remaining time before expiration and set that in the LRU so that both expire at the same time.
Make a few tradeoffs and set the TTL the same as for when you stored it in Redis. Then the maximum time that the item can exist in either Redis or the LRU becomes twice as much and there might be stale data for a certain period. This is better explained with an example:
- We set the Redis item with TTL of 60 seconds
- We get the item from Redis on second 59 and set it in our LRU cache for 60 seconds. So the item is “alive” for 2 minutes instead of 1.
- Another process might query Redis and now see that the item does not exist/expired and fetches a new one and stores it. Now two versions of the same item exist across these two process, the first processes has the old version and the new process has the new version.

Another side note, only if you are digging deeper into this. The in-memory LRU library used in this project will check the expiration time when it fetches the object and if it is expired return undefined as if it was not found and deletes it from memory. There is no background process running in the background to check and proactively delete all keys that expired. The prune() command can be called to check the TTL of all items and remove them if necessary, this is only needed in some scenarios where data is set with a TTL and hardly used after that.

Let’s talk about Lambda

Lambda functions usually over-provision on memory to get proportional CPU and network throughput, leaving the memory unused. The majority of workloads are CPU intensive which leads to an over-provisioning of memory to get proportional CPU which reduces execution time and cost.

The AWS Lambda Power Tuning is a great open source tool to help you find the best memory setting to optimize cost & execution time. Many times the cheapest and fastest option is a higher memory setting.

Chart from the output of the AWS Lambda Power Tuning

To prove my previous statement “The majority of workloads are CPU intensive” we observe that the same lambda & program that works with 1024MB also works on 128MB. The only thing that changed is that we have more CPU, network throughput and other proportionally scaled resources.

The unprovisioned memory can be used for caching data in-memory instead of reaching for an external cache every time. This can reduce downstream pressure on these caches, reduce costs and speed up systems.

Conclusion (TL;DR)

By leveraging an in-memory cache you reduce pressure on the downstream caches, reduce costs and speed up systems. Consider having an in-memory first policy, instead of having an in-memory fallback cache strategy to see the biggest gains.

Should you use microservices?

Rehan van der Merwe — Sun, 14 Nov 2021 22:00:00 +0000

Originally published on my blog: https://www.rehanvdm.com/blog/should-you-use-microservices

Microservices have been around for more than a decade and yet so many still don’t get it right. I am no different, I have created some pretty coupled architectures in my pursuit to build microservices. But similar to programming, having failed many times and knowing how not to do something is also a success.

I recently posted my decision tree to twitter for choosing between a Microservice architecture and a Monolith for a greenfield project. It blew up more than I thought it would, but I got some pretty good feedback, click on the above tweet or see the original image below.

Original diagram

The revised diagram (thank you internet)

The revised diagram is a direct result of all the feedback I got. I could not fit it into a decision tree 🤷‍♂️ anymore, so have a table instead:

Revised diagram

Important points raised

Team size and communication plays a huge factor in choosing the right architecture.

Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure. - Melvin E. Conway (1967)

Bringing Conway’s law into this discussion, translates to multiple teams will gravitate towards microservices and a large singular team will tend to build monolithic software.

https://sketchplanations.com/conways-law

If you are doing microservices as a single team, then be prepared for the cognitive overhead and discipline that is needed to properly implement microservices. Many times it does not make sense from a business perspective, you will most likely have less developing velocity and increased cost.

Your engineering standards have to be really high to do microservices as a single team. On the other hand, multiple teams where each team owns at least one service can quickly lead to disjointed communication if the right processes aren’t followed.

Serverless != Microservices

Another misconception that I often see, is the belief that as soon as you are doing Serverless, you are doing Microservices. Just because you have 100’s of lambdas for your API does not mean you are doing Microservices. It is still a Monolith if it is a singular deployment with business logic shared between the lambdas.

Challenges of microservices

In the original diagram I had limited space but decided to pad the empty space with some of the most important points, but could not include all of them. Here it is again, including what I didn’t mention and some that was also raised by others:

Schema versioning and management is difficult, you need to honor backward and forward compatibility.
Transactions become distributed, needs to be done with the saga pattern, event choreography or event orchestration
Data duplication and consistency make bulk changes (or “fixes”) to entities difficult. You can not just update all rows in a DB column. Events need to be emitted to other services as well
Synchronous service communication introduce latency and are unavoidable at times. It is easy to inadvertently create long service call chains between your services, leading to over chatty and dependant services.
Asynchronous service communication introduces eventual consistency, this usually requires systems to handle failures and be idempotent as well.
End-to-end monitoring is difficult, distributed logging and tracing is now needed.
Testing, especially E2E, becomes more difficult. Your environment X, let’s say pre-prod, needs to communicate with all other services in environment X with clean data for tests to pass.
Reporting and aggregations become difficult, data needs to be pulled from many services.
Each service needs its own CICD pipeline, testing, environments ect.

Research

Research and educate the team, at least have a shallow understanding of why these concepts exist and what problems they solve. Many of these points are not directly related to microservices but will be needed to implement microservices correctly.

Event sourcing and event storming
Event choreography and event orchestration
The saga pattern and the circuit breaker pattern
Eventual consistency and the CAP theorem
Event Carried State Transfer
DDD: Domain Driven Design and bounded context
CQRS: Command Query Responsibility Segregation

What I had wrong

There are 2 concepts closely related and that is the Distributed monolith and the Modular monolith. The main difference is in how you structure your business logic and deploy. To sum it up, your architecture will fall within one of these categories:

Traditional monolith:

No boundaries between code, it is one application.
Deployed as a single application.
No physical boundaries.

Distributed monolith:

Loose Boundaries, sharing business logic between services.
Services deployed independently (usually in the same repo, but can be multiple).
Always has physical boundaries between them (think each service having their own compute or DB).

Modular monolith:

Strict Boundaries, services might be sharing code(think utility functions) but not business logic.
Services deployed together (usually in the same repo, but can be multiple).
Might not have physical boundaries.

Microservice:

Strict Boundaries.
Services deployed independently.
Always have physical boundaries.

The main difference between a distributed and modular monolith is that a distributed monolith shares business logic between services, have independent service deployments and each service has their own physical boundaries and resources. This is the worst of the 4 categories. Some signs of a distributed monolith include:

Shared business logic like each service querying the authentication service DB directly for authentication.
Long deployment times because each service needs to deploy independently.

Shoot for Microservice and fall on a Modular monolith if you have to. They are usually my go-to for any project that is small to medium with a single team. Modular monoliths usually live within a single project, where each folder might be a service.

Purists will argue that no code must be shared between services, but it usually happens that within a modular monolith utility functions are shared. As long as business logic stays within the boundaries of a service. They also share physical resources like DBs and are deployd together.

Conclusion (TL;DR)

Use the right architecture for the job, consider the non-technical requirements first when deciding to use microservices. Does it fit with our company/project structure? Does the team have the required skills? Do I understand the business impact this will have?

There is nothing wrong with a monolith , you can still implement best practices and hit the targets set out by business. At some point your team will grow or usage pattern might change, then only consider the transition to microservices.

4 Methods to configure multiple environments in the AWS CDK

Rehan van der Merwe — Sun, 24 Jan 2021 15:56:56 +0000

In this post I will explore 4 different methods that can be used to pass configuration values to the AWS CDK. We will first look at using the context variables in the cdk.json file, then move those same variables out to YAML files. The third method will read the exact same config via SDK(API) call from AWS SSM Parameter Store. The fourth and my favourite is a combination of two and three in conjunction with using GULP.js as a build tool.

The accompanying code for this blog can be found here: https://github.com/rehanvdm/cdk-multi-environment

1. The CDK recommended method of Context

The first method follows the recommended method of reading external variables into the CDK at build time. The main idea behind it is to have the configuration values that determine what resources are being built, committed alongside your CDK code. This way you are assured of repeatable and consistent deployments without side effects.

There are few different ways to pass context values into your CDK code. The first and easiest might be to use the context variables on the CDK CLI command line via --context or -c for short. Then in your code you can use construct.node.tryGetContext(…) to get the value. Be sure to validate the returned values, TypeScripts (TS) safety won’t cut it for reading values at runtime, more in the validation section at the end. Passing a lot of variables like this isn’t ideal so you can also populate the context from file.

When you start a new project, every cdk.json will have a context property with some values already populated that are used by the CDK itself. This was my first pain point with using this method, it just didn’t feel right to store parameters used by the CDK CLI in the same file as my application configuration (opinionated). Note that it is possible to also store the .json file in other places, please check out the official docs (link above) for more info.

We are storing both development and production configuration values in the same file. Then when executing the CDK CLI commands we pass another context variable called config.

Passing in a context variable to select the correct config within index.ts

This is read within index.ts and it chooses one of the available environment configurations as defined in our cdk.json file. It is all done inside the getConfig(…) function, notice that we read each context value individually and assign them to our own BuildConfig interface, located at /stacks/lib/build-config.ts

/stacks/lib/build-config.ts

An instance of the buildConfig is then passed down to every stack , of which we only have one in this example. We also add tags to the CDK app which will place them on every stack and resource when/if possible. Passing the region and account to the stack enables us to deploy that specific stack to other accounts and/or regions. Only if the --profile argument passed in has the correct permissions for that account as well.

Loading config and passing buildConfig to the MainStack in /index.ts [Click to enlarge]

The next methods all have the exact same code and structure the only differences are the getConfig function and execution of CLI commands.

The MainStack (below) that we are deploying has a single Lambda in it, with a few ENV variables and the Lambda Insights Layer of which we all get from the config file.

/stacks/main.ts

2. Read config from a YAML file

With this method we split our application configuration from the CDK context file and store it in multiple YAML files. Where the name of the file indicates the environment.

YAML config files

Then a slight change in our index.ts for the getConfig function so that it reads and parses the new YAML files instead of the JSON from the context.

getConfig now reading from file and parsing YAML

3. Read config from AWS SSM Parameter Store

This method is not limited to just the AWS SSM Parameter Store but any third-party API/SDK call can be used to get config and plug it into the CDK build process.

Loading config from AWS SSM Parameter Store

The first “trick” is to wrap all the code inside an async function , and then execute it. Now we can make full use of async/await functions before the stack is created. Inside the getConfig(…) function we now also require that the profile and region context variables be passed when executing the CLI commands.

Now we also need to pass the profile and region used by the AWS SDK

This is so that we can set them to be used by the AWS SDK which in return makes authenticated API calls to AWS for us. We created the SSM Parameter Store record (below) with the exact same content as the YAML files. So that after retrieving it, we parse and populate the BuildConifg exactly the same as we did for the YAML files method.

AWS SSM Parameter Store record storing config as YAML

This method has the advantage that your configuration file is now independent of any project , is stored in a single location and can even be used by multiple projects. Storing the complete project config like this is a bit unorthodox and not something that you will do often. You would ideally store most of the config on a project level and then pull a few global values used by all projects , more on this in the next method.

4. Make use of an external build script with both local and global config

In this example make use of method 3 and 4 above by having:

Project config (YAML file), for this project, including AWS profile and region.
A global config (AWS SSM Parameter Store) to be used by all projects.

We only store the Lambda Insight Layer ARN in our global config which is AWS SSM Parameter store. So that when AWS releases a new version of the layer, we can just update it in our global config once and all projects will update their usage of it the next time they are deployed.

We are using of a GULP.js script and executing it with Node. It basically does the following :

Reads the local YAML config file, depending on the environment, this defaults to the branch name.
Get the AWS SSM Parameter Name (from the local config) which holds the global config. Fetch the global config and add to the local config.
Validate the complete configuration, with JSON Schema using the AJV package.
Write the complete config to file to disk so that it is committed with the repo.
Run npm build to transpile the CDK TS to JS.
Build and execute the CDK command by passing arguments like the AWS profile and config context variable. When the CDK is synthesised to CloudFormation in the index.ts, just like before in method 2, it will read the complete config that we wrote to disk at step 4.

The generateConfig method in the /config/gulpfile.js

The gulp task that builds the CDK, gets the config and runs the diff CDK CLI command

Now instead of running npm run cdk-diff-dev, we run:

node node_modules\gulp\bin\gulp.js --color --gulpfile config\gulpfile.js generate_diff

and for deploying:

node node_modules\gulp\bin\gulp.js --color --gulpfile config\gulpfile.js deploy_SKIP_APPROVAL

Notice that we don’t pass the environment in these commands and let it default to the branch name , with the exception that if on the master branch it uses the prod config. The getConfig(…) function within the GULP.js file allows for this to be passed down explicitly. This deployment method also works on CI tools.

The getConfig function used in the index.ts is similar to method 2, except that it does validation using AJV and JSON Schema (see section below on validation).

The getConfig(…) function inside index.ts

One of the biggest advantages of using a GULP.js file and executing it with Node is that it makes our deployment process operating system (OS) independent. This is important to me since I am on Windows and most people always write Make and Bash scripts forcing me to use the Ubuntu WSL2.

One of the biggest advantages of using a GULP.js file and executing it with Node is that it makes our deployment process OS independent.

This deployment process is quite versatile. I have used this GULP.js method from before I was using Infrastructure as Code (IaC) tools, back when we only wanted to update Lambda code. Some form of it has since been used to deploy CloudFormation , then SAM and now the AWS CDK.

A few words about:

Validation

TypeScript only does compile time checking, which means it does not know if that YAML/JSON that you are decoding is actually a string or defined at runtime. Thus, we need to manually verify and put safe guards in place at runtime. Method 1 through 3 just did a basic check within the index.ts using function ensureString(…) where the config is read.

For this method we are using a slightly more advance approach. The AJVpackage validates a JSON object against the JSON Schema of our BuildConfig file. This way we can write a single schema file that defines rules like ensuring certain properties are set and start with the correct AWS ARN.

Writing JSON Schema and keeping it up to date is cumbersome, that is why we opted to use the typescript-json-schema package. It converts our already existing TypeScript BuildConfig interface (at /stacks/lib/build-config.ts) into a JSON Schema and stores it in the config directory at /config/schema.json. Now when the GULP.js and index.ts files read the config, they both validate it against this JSON Schema.

BuildConfig TS Interface converted to JSON Schema

Project structure

If you are following along with the code, you will also notice that I don’t structure my CDK projects like the initial/standard projects.

This again is opinionated , but the initial structure doesn’t seem logical to me and doesn’t always work for every project.

All stacks go into /stacks, the main CDK construct is on the root as index.ts and all application specific code goes into /src. The /src dir will have sub directories for things like /lambda, /docker, /frontend as long as it makes logical sense. Then not displayed here is the sometimes needed /build dir where the /src code gets built for production and stored. The CDK then reads from the /build instead of/src.

Conclusion ( TL;DR )

The accompanying code for this blog can be found here: https://github.com/rehanvdm/cdk-multi-environment

There are many different ways to store config for a CDK project. With my favourite being the last method of storing them as YAML files at project level and using a GULP.js script as a build tool. Which ever method you choose, always remember to validate the inputs.

CloudFront reverse proxy API Gateway to prevent CORS

Rehan van der Merwe — Tue, 20 Oct 2020 20:10:50 +0000

In this blog we will do a quick recap of CORS and reverse proxies. Then we will show how a reverse proxy can eliminate CORS, specifically in the context of a SPA hosted on CloudFront with an API Gateway backend. The sample code focuses on public, authenticated routes (Authorization header) and IAM signed request all being reverse proxied through CloudFront. Everything is done with the AWS CDK and can be found here => https://github.com/rehanvdm/cloudfront-reverse-proxy-apigw

This article was originally published on my blog.

I wonder if it is safe to assume that every developer that has worked with an API, knows what CORS is. I bet you are here because like many others you have lost countless hours against the battle to properly implement CORS.

Pre-requisites and assumptions:

You have basic knowledge of AWS services like CloudFront, S3, Lambda, API GW and IAM.
If you are following the code samples, that you may already have the AWS SDK and CDK installed.

Quick CORS intro

CORS stands for Cross-Origin Resource Sharing , it restricts a web application running on one origin (protocol & domain & port) from accessing resources on a different origin.

The browser will first have to do what is known as a preflight request. It sends an OPTIONS request to the target, asking if it is allowed and willing to accept the actual request. This way the target domain can decide who and what the trusted sources are.

CORS logic, credits Wikipedia

This results in two separate requests for every request that you make , which really makes for a terrible user experience especially for users that are not close to your origin. To give you a better idea, here in South Africa the round trip to us-east-1 is anything from 400 to 800ms on a decent connection. With the latency that CORS adds, you can expect a basic request in our region, as observed by the user to be between 800 and 1200ms. *Eliminating CORS would almost yield a 50% increase in API latency. *

CORS flow showing latency, credits Mozilla

The only way to eliminate CORS and prevent the preflight requests is to have both the frontend and the backend on the same origin. This problem basically does not exist in a traditional application where both the frontend and the backend are on the same server and thus origin.

It is only when your frontend and backend are on different origins like in our case the frontend is running as a SPA on CloudFront and the backend is done using API Gateway. A reverse proxy solves this by allowing the frontend to call a path on its origin that forwards the request to API Gateway.

What does a Reverse Proxy Solve?

A reverse proxy requests resources on behalf of the current origin from one or more of the target origins. These resources are returned to the client appearing as if they originated from the current origin.

This is better understood visually, in the image below, CORS will be present as the frontend code makes requests to API Gateway directly to interact with the backend.

The solution here is to set CloudFront up as a reverse proxy on let’s say path /backend-api/* so that whenever data is sent to /backend-api/*, it is sent to the API Gateway. The frontend code then needs to make requests to itself (the origin it uses) at path /backend-api instead of using the different origin that is API Gateway.

CloudFront acts as both a CDN and a reverse proxy. The benefits that we gain from having this specific CloudFront setup includes:

No CORS preflight request is needed, both frontend and backend API are on the same origin. Thus an approximate 50% decrease in API request latency.
More consistent (and usually faster) API request routing. From a user perspective, the API requests will hit the closest CloudFront Point-of-Presence(POP) and then traverse to API Gateway on the AWS backbone network as opposed to traversing the public internet to the API Gateway.

We can also leverage other functionality, for example:

Terminate HTTPS at CloudFront and send data using HTTP to the backend, saving backend resources from doing the computationally expensive SSL dance.
Response compression.
Independent request caching available that can be set using the backend.

I am not going into further details about why/what and how CORS & Reverse Proxies work, this should be sufficient information for the rest of the post. I linked to additional resources at the end of the post if you want to do further reading.

Show me the codes!

The project has a stock standard CDK layout and is written in TypeScript , while the backend Lambdas are written JavaScript. The frontend is basic HTML and JavaScript, there is both a /src and a /dist folder for the frontend as one of the libraries needed is bundled using browsify. The complete project can be found here => https://github.com/rehanvdm/cloudfront-reverse-proxy-apigw

Our CloudFront has a specific behavior to forward all requests at path _ /cf-apigw _ to our API Gateway domain , it is very important that we use the API Gateway stage as the origin path. This configuration eliminates CORS as the frontend no longer has to call the API Gateway directly but just a path on the same frontend domain.

The majority of the time you don’t want to integrate directly to an API Gateway domain so that it can be treated as ephemeral when you are doing Infrastructure as Code (IaC). Most scenarios front API Gateway with an API Gateway Custom domain that you own which then forwards to the AWS API Gateway domain and stage.

I have also included this as a different path on the same CloudFront. This path being _ /cf-cust-domain _ which will forward all requests to the custom domain which in return forwards it to the actual API Gateway. Keep in mind that the API Gateway Custom domain service is a “specially” designed CloudFront that AWS controls for you.

The main takeaway from these diagrams is to show the configuration for the CloudFronts and what path it will result in when it hits your API Gateway. If you haven’t figured it out already, CloudFront prepends the behavior path pattern to the path that it is forwarding. This means that we need the whole API to be beneath a resource with the same name as the path pattern. This logic holds true for anything that you reverse proxy through CloudFront.

Below are a few screenshots of what the AWS console looks like for these resources configured with the CDK in the above setup. I am only showing the API Gateway Custom domain setup as it is the trickier of the two to get right.

CloudFront Origin for behavior

CloudFront Behavior

Be sure to allow ALL methods for your API as you would need the full range when implementing REST. Since we authenticated paths, you must whitelist the Authorization header. Never whitelist the Host header, the second CloudFront(Custom Domain) will just refuse the request.

I am effectively disabling the caching for this behavior by setting all the TTL values to 0. The backend processing, in our case Lambda, can respond with the appropriate caching headers and CloudFront will apply them.

API Gateway custom domain

API Gateway prod stage

This is the important part, especially if you are not proxying all requests to the same Lambda. All of your API resources now need to live under the same path as the CloudFront behavior path , in our case, it is /cf-cust-domain. All of these methods hit the same Lambda function that just echoes back the event object as received by the Lambda.

I want to focus on 3 different requests as made from CloudFront:

/cf-cust-domain/no-auth

This route will not specifically match anything that we defined in our API Gateway stage above, thus it will fall under the /{proxy+} path under the root. I specifically set up this “catch undefined routes” resource to debug and get my head around the problem. This is also how I am testing unauthenticated routes.

By inspecting the browser web console, we can see the responses for this request, observe the path that the Lambda saw and what resource was hit on API Gateway:

/cf-cust-domain/auth

This route does have a resource and method defined which is set up to use a Lambda Token Authorizer with the Token Source set to the Authorization header. We can see from the response that this route resource was hit.

We can also inspect and see the authorizer that is attached values that have been added by the stock standard AWS example of the Lambda Token Authorizer that I implemented. The request included an Authorization header with the value of allow as per example, this allowed the authorizer to pass.

/cf-cust-domain/auth-iam

This is the last and most complicated route, the method on API Gateway has Auth set to AWS_IAM. This requires you to first sign the request with your current IAM profile/role before making the request and then adding the signing headers when you make the request, you can read more about this hereand here. I used the aws4npm package to do the signing process for me as per browser example.

The function that we have been using so far looks like this:

The one that includes the signed headers that we must use for IAM auth looks like this:

You still require the Custom domain & path mapping _ OR _ the API Gateway domain & stage for signing the request. You can then make the request as per usual. Below are all the requests made to both CloudFront paths, more info can be found under /src/frontend/src/index.js

Result

Navigating to the CloudFront domain that was created by the CDK stack will greet you with this very basic html page to capture the details required to make the requests above.

I left the default values for example only, please change these if you are following along in code. Running this on my localhost means I have a different domain than the CloudFront one, the browser will thus send OPTIONS requests as per CORS specification. I left CORS enabled on the API Gateway with very permissive values of all (or *) so that we can compare the result while running on our localhost during development.

Requests made from localhost origin

You can see the impact CORS has on latency by inspecting the Waterfall, the GET request is made and is then blocked until the OPTIONS request informs the browser to continue with the GET request. The same requests on the CloudFront site results in this network output:

Requests made from CloudFront origin

We see that there are no more OPTIONS request being made, success!

For those who might be interested in the full project code, it can be found here => https://github.com/rehanvdm/cloudfront-reverse-proxy-apigw. Find below for those curious in the 200 line CDK stack that produced the 1500 lines CloudFormation template for this example.

Click to enlarge

Conclusion (TL;DR)

We showed you can use CloudFront to reverse proxy to your backend on API Gateway. This eliminates CORS which can hugely decrease request latency up to 50%. You have no excuse to not prevent CORS, if you control both the backend and frontend, every technology and framework has some concept of a reverse proxy.

Final note, you can use Lambda@Edge to remove the path that CloudFront prepends when it forwards the request. This adds extra latency to the request, but it will be much less than a round trip. My recommendation is to create a completely new API Gateway and API Custom domain to be used by CloudFront that is the exact same copy as used by the original Custom domain. This is becomes really easy if you are using an IaC tool.

Additional Resources:

CORS:

Reverse Proxy:

Refactoring a distributed monolith to microservices

Rehan van der Merwe — Thu, 30 Jul 2020 04:28:50 +0000

This article documents the thought process and steps involved in refactoring a distributed monolith to microservices. We are going to remove API GW, use Amazon Event Bridge and implement BASE consistency in the system to truly decouple our microservices.

This blog is also available as a presentation. Reach out if you would like me to present it at an event.

I will use the codebase from the previous installment to the series that can be found here. The first part focuses on creating the codebase and implementing AWS native observability, monitoring and alerting services. As always you can find the code that we used in the respective GitHub repositories over here: https://github.com/rehanvdm/MicroService.

Original System

The original system is a distributed monolith that consists of three microservices.

Original system (Click to enlarge)

Within each project you can find an OpenAPI (part2/src/lambda/api/api-definition.yaml) file that defines the API definition for each service. AWS CDK is used and they all follow the similar stock standard CDK project layout: Typescript for the CDK and ES6 JS for the application code. NPM commands have been written to do deployments and it also contains end-to-end tests using Mocha and Chai. In addition, each service contains a detailed README inside the /part2 path. Note that I only have a single Lambda for the API endpoint and do internal routing. Yes, I believe in a Lambalith for the API!😊 and also prefer JSON POST over REST (more about this later).

A problem arises as soon as these microservices start to call one another. We will focus on the creation of a new person to demonstrate how this tight coupling is working.

The client service stores clients and has basic create-client and find-client functionalities as well as an endpoint to increment the person count for a specific client. The person service also has basic create-person and find-person endpoints. When a person is created, it calls the common service which notifies me by email about the new person that was added using an SNS subscription. The common service first needs to do a lookup on the client service so that it can enrich the email. It also increments the counter on the client. Click on the image below to see the step-by-step path for creating a person:

Original system – full create person flow (Click to enlarge)

The create-person call is highly dependent on the common service and does not even know that the common service is dependent on the client service. As a result, the person service is also dragged down if either the common or the client service is down. Not to mention that it now has to wait for the completion of every step in the synchronous chain. This wastes money and increases the probability of hitting the API Gateway timeout of 29 seconds.

Decoupling with Amazon Event Bridge

Amazon Event Bridge is a serverless event bus that makes it easy to work with event-driven architectures. It works on a basic publish and subscribe model. We use it to emit certain events like: person-created and client-created. Other services can then listen to only the events that they want and act on it. The new system is refactored to incorporate this and remove the direct HTTP API calls between services.

New system (Click to enlarge)

The client service has not changed much. It is stillfronted with API Gateway (GW). It now emits an event onto the bus whenever aclient is created. A new Lambda function is added that listens to the create-person events. This increments the person counter for that specific client.This feature was previously on the common service but has now moved to theclient service.

The person service is working exactly as before. Justlike the client service, it also emits an event onto the event bus,specifically the create-person event.

The common service no longer needs to be fronted byAPI GW. Instead it listens to both the create-client and _create-person_events. The common service stores the client data in its own DynamoDB table. Ituses this to look the client up within itself (locally), rather than calling anHTTP API to get the data for a specific client. The common service still sendsan email when a new person is added.

From an external integration point of view , all API endpoints stayed exactly the same. The diagrams below clearly illustrate that each service is only concerned with its own data and responsibilities. The event that is emitted onto the bus is also added for convenience.

New system – partial create person flow (Click to enlarge)

New system – partial client person flow (Click to enlarge)

Internally the common service has an Event Bridge Rule with the Lambda function as target. It listens to the create-client events and then stores only the client-id and client name fields within its own DynamoDB table. This removes the need for it to do an HTTP API call to the client service as it can now just do the lookup locally against its own data store.

New system – full create client flow (Click to enlarge)

The common service also listens to the create-person event. It looks up the client information in its own DynamoDB table and then sends the SNS message. At the same time , the client service also listens to the create-person events. It uses the client-id that comes with the event to increment the person counter for that specific client in the client service DynamoDB table.

New system – full create person flow (Click to enlarge)

What has changed?

We used Event Bridge to remove direct HTTP API calls between microservices. It also allowed us to move some logic to where it belongs. The common service should not be responsible to increment the specific client’s person counter in the first place. That functionality is now contained within the client service, where it belongs.

We basically borrowed two principles from the SOLID OOP principles :

Single Responsibility – Each service is only concerned with its own core functionality.
Open Closed – Each service is now open for extension, but the core functionality is closed for modification.

Then we introduced BASE consistency into the system:

B asic A vailability – Even if the client service is down the common service can still operate as it has a copy of the data. Thus the data layer/plane is still operational.
S oft State – Stores don’t have to be write-consistent or mutually consistent all the time.
E ventual Consistency – The system will become consistent over time, given that the system doesn’t receive input during that time.

A BASE system is almost always going to be an AP system if you look at the CAP theorem. Meaning it favors A vailability and P artitioning over C onsistency.

CAP theorem applied to Databases (Click to enlarge)

DynamoDB is an example of an AP system as well. Your data is stored on multiple 10GB partition drives spread over multiple Availability Zones. That replication takes a few milliseconds, up to a second or two. Obviously, there are cases where you need to read the data back directly after writing. That is why you can specify a Strongly Consistent read, which just goes back to the writer node and queries the data there instead of waiting for the data to have propagated to all nodes.

S3 has read-after-write consistency for the PUT item command but all other commands are subject to eventual consistency. This means that after updating an item, the old content may still be returned for a short amount of time until the content of that file has propagated to all the storage nodes. There are more examples such as these all over the AWS ecosystem.

Event-driven architectures come with their own pros and cons. Firstly, you need to approach the problem with a different distributed mindset that needs to be present within the company/team as well. Event versioning and communication between teams that own microservices are also crucial. It is a good idea to have a service that keeps a ledger of all events that happen in the system. Worst case, this history of events can be replayed to fix any processing errors in the downstream processing.

Event bridge, like SNS and SQS, guarantees delivery of a message at least once. This means your system needs to be idempotent. An example of an idempotent flow is when the client is created and the common service does a PUT command into the DynamoDB table. If that event gets delivered more than once, it just overwrites the current client with the exact same data.

An example of a non-idempotent flow is when the person iscreated. If the client service gets more than one message, it increments theclient-person counter more than once. There are ways to make this callidempotent, but we’ll leave that for a different blog.

Another thing to consider is that not all systems can accept the delay that eventual consistency introduces into a system. It is perfectly acceptable in our system as a person will probably not be created immediately within one second after a client has been created. Thus, whenever the common service does the client lookup locally, it can be assured that the client data is always populated.

Resilience to Failure

One of the benefits that we have achieved by refactoring to a microservice system is that we are now r esilient to complete services failure. The client service can still operate if the person and common service is down. Similarly, the person service can operate on its own and is not dependent on the other services.

We also removed any timeout problems that was introduced by the previous architecture that synchronously chained API calls. If an Event Bridge target service (like Lambda, in our case) is down, it will retry sending the message with exponential bakeoff for up to 24 hours.

All Lambda functions that process the asynchronous events from Event Bridge have Dead Letter Queues (DLQ) attached. The event will be moved to the DLQ after three unsuccessful processing attempts by the Lambda function. We can then inspect the message later, fix any errors and replay the message if necessary.

Basic chaos can be introduced into the system to test its resilience. This is built into the code and can be toggled on the Lambda functions with environment variables. If ENABLE_CHAOS is true, then the following environment variables are applied:

INJECT_LATENCY – Number || false
INJECT_ERROR – String; two possible values: error will throw a hard error and handled will throw a soft error

A Lambda service failure can be simulated by setting Reserved Concurrency to 0. Event Bridge will then retry delivering the event for up to 24 hours.

Conclusion (TL;DR)

We refactored a distributed monolith to a microservice architecture using Event Bridge and broke dependencies using BASE consistency. The code for this blog can be found here: https://github.com/rehanvdm/MicroService.

AWS Serverless: you might not need third party monitoring

Rehan van der Merwe — Wed, 15 Jul 2020 08:15:00 +0000

I hardly ever find myself reaching for third party monitoring services these days. I rather use the AWS native observability, monitoring and alerting services. The primary reasons being that I can use my favorite Infrastructure as Code (IaC) tool to define the infrastructure as well as the monitoring, observability and dashboards for every project in one place. I also only pay for what I use; there are no monthly subscriptions.

This blog is also available as a presentation. Reach out if you would like me to present it at an event. It consists of about 30% slides and 70% live demo.

In this two-part series, we’ll first build a bad microservice system and add observability, monitoring and alerting. The second part will focus on refactoring the code and go into more details on the decisions made.

Like most of my blogs, this one is also accompanied by code. I decided to create three microservices, each in their own repositories, with the fourth one used to reference all of them. The code is available on github: https://github.com/rehanvdm/MicroService. These microservices were designed poorly for demo purposes and to explain the importance of certain points, like structured logging. Below are all three services:

Within each project you can find an OpenAPI (part1/src/lambda/api/api-definition.yaml) file that defines the API definition for each service. AWS CDK is used and they all follow the similar stock standard CDK project layout: Typescript for the CDK and ES6 JS for the application code. NPM commands have been written to do deployments and it also contains end-to-end tests using Mocha and Chai. In addition, each service contains a detailed README inside the /part1 path. Note that I only have a single Lambda for the API endpoint and do internal routing. Yes, I believe in a Lambalith for the API!😊and also prefer JSON POST over REST (more about this later).

In Part 2 we will focus on refactoring and decoupling this system. That brings me to the reason why the current system is poorly designed. I purposefully created a distributed monolith.

The create-person call is highly dependent on the commonservice and does not even know that the common service is dependent on theclient service. As a result, the person service is also dragged down if eitherthe common or the client service is down. Not to mention that it now has towait for the completion of every step in the synchronous chain. This wastes moneyand increases the probability of hitting the API Gateway timeout of 29 seconds.

Let’s first look at a few generic concepts that are referenced throughout the post. Then we will look at the AWS native services.

Structured logging, types of errors and metrics

Errors

Hard Errors are infrastructure and runtime errors. You should always have alerts on these. Ex. time out, unexpected error and runtime errors not caught by try-catch blocks.

Soft Errors are completely software-defined. This is when your infrastructure and services are working but the result was undesired. An example would be that your API returned an HTTP status code 200 with a validation error message in the body.

Metrics

Business Metrics – Key performance indicators (KPIs) that you use to measure your application performance against. Ex. orders placed.

Customer Experience Metrics – Percentiles and perceived latencies that the user is experiencing. A typical scenario would be page load times. Another would be that even though your API is fast, the front-end needs to make 10 concurrent API calls when the application starts. The browsers then queues these concurrent requests and the user waits at least two or three times longer.

System Metrics – Application level metrics that indicate system health. Ex. number of API requests, queue length, etc.

Structured logging

All microservices write logs in the format below. This is done by wrapping around the console class and writing in JSON format. Levels will include all your basics, like info, log, debug, error, warning, with the only new one being audit.

A single audit record is written per Lambda execution and gives a summary for the result of that execution. The image below shows an audit record that contains the API path, run time, status code, reason and many more fields used in the Log Insight queries later on.

The image below shows an unsuccessful execution:

Note that the runtime is not the one reported by Lambda.There is an environment variable on the Lambda itself that indicates what thefunction timeout value is. We then subtract the _context.getRemainingTimeInMillis()_to get a close estimate to the actual reported runtime.

Let’s take a closer look at the AWS Native services that we will use.

AWS CloudWatch Logs

Logs are crucial to any application. CloudWatch stores logs in the format of log groups and log streams. Each log group can be considered a Lambda function and a stream is the executions of that Lambda function. The real magic happens when you do log insights over your structured logs.

AWS CloudWatch Metrics

Metrics are best described as the logging of discrete data points for a system against time. These metrics can then be displayed on graphs, like; database CPU versus time or the types of API response over time. They are at the heart of many services, like dashboards, alerts and auto scaling. If you write a log line in a specific format, called the Embedded Metric Format, it automatically transforms it into a metric. Find the client libraries that help write this format here.

Below shows the amount of API calls summed by 1-minute intervals for the client API after the artillery.io load test was run.

AWS CloudWatch Alarms

Alarms perform an action when certain conditions on Metrics are met. For example, CPU more than 50% for 3 minutes. Actions include emailing a person or sending the message to an SNS topic. This topic can then be subscribed to by other services. We subscribe to this topic with AWS Chatbot to deliver the notifications to a Slack channel.

It is important to subscribe to both the ALERT and the OKAY actions. Otherwise you will never know if your system stabilized after an alert unless you log into the console and inspect the system.

Composite Alarms are great when you want to string together some sort of logic to give a higher order of alarm/event. For example, you can create an alarm if the database CPU is more than 50% and the API hit count is less than 1000 requests per minute. This will set off an event/alarm informing you that your database might be crunching away at a difficult query and that it might be a result of a person executing a heavy analytical query rather than your application.

Anomaly detection uses machine learning (random cut forest) to train on up to two weeks of metric data. This creates upper and lower bands around your metric which are defined by standard deviations. Alerts can then be created whenever your metric is outside or within these bands. They are great at monitoring predictable periodic metrics, like API traffic.

AWS CloudWatch Metric filters

CloudWatch Metric filter will search the Logs for patterns and publish the search results as Metrics. For example, we can search for the word ‘retry’ in the logs and then publish it as a metric that we can view on a dashboard or create an alarm from.

This is how we count soft errors , which are errors that don’t crash the Lambda but return an undesired result to the caller. In our example, all the API Lambdas always return HTTP Status code 200. Within the body of the response is our request status code: 2000 – Success, 5000 – Unexpected, 5001 – Handled, 5002 – Validation, 3001 – Auth. Structured logging always writes the audit record in a specific format. We use this to create metrics and then alarms based on those metrics.

AWS CloudWatch Dashboards

Dashboards are great to get an overview of the operational status of your system. In the example services, each one also deploys their own dashboard to monitor basic metrics of the Lambda, API Gateway and DynamoDB table. Everything is defined as IaC using the AWS CDK.

A manual dashboard can also be created to combine all the services onto one dashboard. I usually tend to make it less granular by just displaying the overall status of each microservice and other useful information.

The dashboard above contains three CloudWatch Log Insight query widgets. We can even write basic markup to create links/buttons, as seen in the bottom right corner.

AWS CloudWatch Log Insights

Log Insights enable us to do SQL-like querying over one or more Log Groups. This tool is extremely powerful to get insights out of your structured logs. It also has basic grouping functionality that can graph results. For example, we use a single query to query the audit records of all three microservices over the last 2 weeks (see below).

We can also compare the latencies of all the microservice API calls and visually graph it in a bar chart.

Lastly, I want to highlight the most impactful API calls. This is taking the amount of calls and multiplying it by the 95-percentile latency. This gives us a quick indication of which API calls, if optimized, will have the biggest impact on the client calling the API.

There are many more queries such as these that you can do to help identify if caching will work on a specific API endpoint and also which type, client or server side, will work best. Other queries are documented in the GitHub README file here: https://github.com/rehanvdm/MicroService. A quick summary of what we can find:

All Audit Records
All Audit Records for a specific TraceID
All Audit Records for a specific User
All Hard errors
All Soft errors
All log lines for TraceID
Most expensive API calls
Most impactful API calls

These queries would not be possible without structured logging. When APIs call each other, they also send the TraceID/CorrelationID downstream. This ID is then used in the logs by the service receiving the request. It enables us to query a single TraceID and find the complete execution path over all three services and their logs , saving a ton of time when you’re debugging.

AWS X-Ray

Distributed Tracing is not something new, but it is a must-have for any distributed and micro service application. X-Ray allows you to easily see interactions between services and identify problems at a glance. Segments are represented by circles with the colour indicating the status of that segment.

Each service that runs the X-Ray agent sends trace data to the X-Ray service. This agent is already installed on the container that runs the Lambda function and uses less than 3% of your memory or 16MB, whichever is greater. Tracing is made possible by passing a TraceID downstream to all the services it calls. Each service that has X-Ray enabled uses the received TraceID and continues to attach segments to the trace. The X-Ray service collects and orders all these traces. The traces can be viewed in the Service Map (image above) or as Traces (image below).

The X-Ray service isn’t perfect. For some services it requires manual wrapping and passing of the TraceID in order to get a fully traced execution. One of these services is SQS and it is documented here.

X-Ray also has sampling options to not trace every request. This is helpful if you have high throughput applications and tracing every request would just be too expensive. By default, the X-Ray SDK records the first request each second, and five percent of any additional requests. This sampling rate can be adjusted by creating rules on the console.

AWS Chatbot

AWS Chatbot is an interactive bot that makes it easy to monitor and interact with your AWS resources from within your Slack channels and Amazon Chime chatrooms.

We only use it for sending alarms to a Slack channel because it is such an eyesore to look at the alarm emails. AWS Chatbot can do a lot more though; you can directly run a Lambda function or log a support ticket when interacting with the bot.

Quick word about the

As mentioned before, the API function is a Lambdalith. Just to summarize the reasoning behind this madness:

By having 1 Lambda, I save $$ by setting provisioned capacity on 1 Lambda as opposed to setting it on each endpoint.
Less downstream calls are made, like fetching secrets.
Less cold starts occur.

A fewpoints about the overall structure:

All API calls are proxied to a single Lambda function. Then the Lambda ‘routes’ to a certain file within its code base for that endpoint logic.
The common directory has the data_schemas which are 1-to-1 mappings of how data is stored in DynamoDB. The v1 directory that handles the endpoints does all the business logic.
There is an OpenAPI 3 doc to describe the API.
There is structured logging done by a basic helper class that wraps the console.log function.
Consistent error handling is forced.
There is one audit record for every execution.

Part 2 – coming soon

Asmentioned above, Part 2 will fix the coupling between the micro services. Wewill look at:

Creating BASE ( B asic A vailability, S oft state, E ventual Consistency) consistency over the whole system.
Decoupling the service with Event Bridge.
Adding a Queue between the Common and Client API to increment the counter.

Basically,the common service will be listening to the client and person create events.When a client is created, the common service stores a local copy of that clientin its own database so that it does not have to do the lookup on an externalservice but can rather go to its own service data. The write from the commonservice to the client to increment the person count can also be decoupled usinga queue.

This changes our distributed monolith to microservice-based architecture, it looks something like this:

Conclusion (TL;DR)

In this post, we focus on the strengths and capabilities of using AWS Native service to monitor distributed applications. In this first instalment of the series, we create a distributed monolith consisting of three services using the AWS CDK. The code can be found here: https://github.com/rehanvdm/MicroService. The second part will focus on refactoring the monolith to a decoupled microservice.

This post has been edited by Nuance Editing & Writing. Check them out for all your editing and writing needs.

An unexpected journey with Lambda & OracleDB

Rehan van der Merwe — Tue, 19 May 2020 05:58:44 +0000

This blog serves to document the unexpected struggles of connecting to an Oracle database with a Lambda NodeJS function. This turned out to be more complex than a single package installation. We will create a Lambda layer for the NodeJS Lambda function to consume; this consists of the Oracle Instant Client Basic Lite v19.x libs + the libaio.so.1 file. While we (developers) will manually install the Instant Client libraries as dev dependencies to locally run and test the application.

This is an article can also be found on my blog.

We will be using the AWS CDK for the infrastructure definitions and deployments. As a bonus we will look at the cold start times and discover how a higher memory setting can reduce connection times.

Whatyou need to know

There are packages out there such as this one that work, but they are old and unmaintained. The last built was for node 8 which means that the oracledb package is 2 major versions behind as all of the variations of this repo and tutorials are using the v12 of the Instant Client libraries while the current version is v19. That repo will also not work for local testing on all operating systems which will be the cause of many headaches if you’re doing team development.

Thus we took the plunge and decided to do this properly. We are using the official oracledb package from oracle themselves. Not a lot in the npm documentation, so we jump over to the quick start guide and immediately run into an enormous wall of text.

It turns out that to talk to an OracleDB you need the operating system specific Oracle Instant Client Basic libraries. After giving the docs a quick (it wasn’t quick) scan you find the download page, hit download and extract. We ran into the first problem, size. These libraries are about 230MB, so if you add the AWS SDK package which is approximately 50MB you won’t be able to deploy your Lambda due to the hard size limit of 250MB extracted

After further investigation we find the Oracle Instant Client Basic Lite libraries, which after extraction is about 110MB. This is better but still huge for a Lambda.

The second ‘gotcha’ buried away in the “quick” start guide is that you will need an extra package; libaio or libaio.so.1. This needs to downloaded and placed into the Instant Client Basic Lite directory along with the other .so files.

The instant client libraries + libaio file needs to be in your Lambda root directory under /lib. Alternatively, you can place them in /opt/lib as the node oracledb package will still detect them without the need for changing any environment variables. This specific environment variable in question is LD_LIBRARY_PATHwhich already defaults to/lib64:/usr/lib64:$LAMBDA_RUNTIME_DIR:$LAMBDA_RUNTIME_DIR/lib:$LAMBDA_TASK_ROOT:$LAMBDA_TASK_ROOT/lib:/opt/lib

Get it working on AWS

The full source code can be found here -> https://github.com/rehanvdm/lambda-oracle-instant-client-blog

The Instant Client libraries can either be in the Lambda or within a Lambda layer. By removing these approximately 110MB libraries from the Lambda package, we can benefit from faster deployments, less bandwidth usage and we optimize not to hit the 75GB combined Lambda storage per account. Using a Lambda layer is a no brainer.

We are using AWS CDK and can create this layer from lines 10 to 15. Consuming this layer is just as easy and can be seen on line 33.

The instant client libraries are in the /lib path and includes the libaio.so file as described above. The directory structure for the layer then looks like this.

Below is our quick test Lambda function that creates a pool with 1 connection to query the system date and time from the database. The DB credentials are passed in through environment variables (please don’t do this in production, this is just a quick experiment).

To deploy we can simply use the following command whichcompiles our CDK TypeScript to JS and then deploys to our account using the.aws/credentials profile called rehan.

# tsc && cdk deploy --profile rehan

Get it working locally on multiple operating systems

Operating system specific files are usually a pain when a team collaborates on a project. The oracledb nodejs package is treated like any other NodeJS package, it is just the OS dependencies that we need to install as developers to locally run and test our application.

Windows:

Download the Instant Client Basic Lite v19.x libs here
Extract them to C:\oracle\instantclient_19_6
Add this new directory to your PATH variable. Restart IE or just the computer.

Linux:

Download the Instant Client Basic Lite v19.x libs here
Extract to a any path, say /opt/oracle/instantclient_19_6
Make sure that path is in your LD_LIBRARY_PATH
Download the libaio (libaio.so.1) which can be done using; sudo yum install libaio -y

Other:

Please refer to the “quick” start guide here as I haven’t done the installs on all platforms, but the gist of it should be the same.

Running the application

We can test locally by setting the environment variables at the top of the test file at _/tests/lambda/oracle-test/test-connection.j_s and then running the following command:

# node ./node\_modules/mocha/bin/mocha --ui bdd./tests/lambda/oracle-test/test-connection.js --grep "^Test Success TestConnect$"

If everything was setup successfully you should see the DB system time.

To test on AWS, navigate to the Lambda console, click on test and just use an empty json event. Click test and then you should see log files similar to these:

 2020-05-17T13:47:06.252Z INFO Pool
 2020-05-17T13:47:10.436Z INFO Pool created
 2020-05-17T13:47:10.436Z INFO Connection
 2020-05-17T13:47:10.817Z INFO Connection created
 2020-05-17T13:47:10.836Z INFO Query
 2020-05-17T13:47:10.897Z INFO Query returned
 2020-05-17T13:47:10.917Z INFO 2020-05-17T13:47:10.000Z
 2020-05-17T13:47:10.936Z INFO Connection
 2020-05-17T13:47:10.936Z INFO Connection closed

Going deeper down the rabbit hole

There were some initial concerns surrounding the large size of the Lambda package and how that will influence the cold start times. Below are some X-Ray screenshots and console logs for different memory settings.

128MB took a TOTAL time of ~5,400ms

 2020-05-17T13:47: **06.252** Z INFO Pool
 2020-05-17T13:47: **10.436** Z INFO Pool created
 2020-05-17T13:47:10.436Z INFO Connection
 2020-05-17T13:47:10.817Z INFO Connection created
 2020-05-17T13:47:10.836Z INFO Query
 2020-05-17T13:47:10.897Z INFO Query returned
 2020-05-17T13:47:10.917Z INFO 2020-05-17T13:47:10.000Z
 2020-05-17T13:47:10.936Z INFO Connection
 2020-05-17T13:47:10.936Z INFO Connection closed

256MB took a TOTAL time of ~3,600ms

Logs and X-ray screenshot omitted..

512MB took a TOTAL time of ~2,300ms

Logs and X-ray screenshot omitted..

1024MB took a TOTAL time of ~1,500ms

 2020-05-17T13:52: **52.547** Z INFO Pool
 2020-05-17T13:52: **53.037** Z INFO Pool created
 2020-05-17T13:52:53.037Z INFO Connection
 2020-05-17T13:52:53.091Z INFO Connection created
 2020-05-17T13:52:53.091Z INFO Query
 2020-05-17T13:52:53.097Z INFO Query returned
 2020-05-17T13:52:53.098Z INFO 2020-05-17T13:52:53.000Z
 2020-05-17T13:52:53.098Z INFO Connection
 2020-05-17T13:52:53.099Z INFO Connection closed
 2020-05-17T13:52:53.099Z INFO Pool
 2020-05-17T13:52:53.101Z INFO Pool closed

Yan Cui wrote an excellent post on how expensive the full aws sdk is so this provides a good baseline of what to expect. If you require the full AWS SDK you are looking at around 245ms of cold start. Since we are only requiring 2 packages, the full aws-sdk and oracledb, we can calculate the time oracledb adds which is 482(from X-Ray)-245 = 237ms. It sounds bad, but not as bad as originally thought, we can live with this.

The real surprise was that the Lambda finished at different times for different memory settings even though the initialization times stay reasonably constant. The 1024MB setting is approximately 4 times faster than on 128MB, so this means that the memory setting has a linear correlation on how fast an extremely simple query runs and that the memory setting has no effect on how long it takes to initialize the big packages.

To find the pain point in the test application, we added good old fashion console.log statements. The logs above show that the connection pool creation is the culprit. There also seems to be a bug in the oracledb package that even if the minPool property is set to 0, it still opens at least 1 connection. This is confirmed by setting the minPool property to 1 and observing the exact same Lambda execution times and logs.

This correlation between memory setting and connection time seems to stabilizes around the 1024MB memory setting. Observing the logs we can see that it takes ~4,200ms on 128MB and ~500ms on 1024MB to open a single connection to the OracleDB which is about 8 times faster.

Troubleshooting

If you get an error like this :

"DPI-1047: Cannot locate a 64-bit Oracle Clientlibrary: \"/var/task/lib/libclntsh.so: file too short\"

It means one of two things, either your path variable to the instant client libs are incorrect and it cannot detect them OR the libiao file that needs to be manually added is missing.

Conclusion (TL;DR)

We created a Lambda layer that hosts the Oracle Instant Client Basic Lite libraries and included the libaio.so file as well. For local development, each developer needs to manually install these operating system specific libraries on their machines. Watch out for the initial connection to the database which seems to be tied to the amount of memory you specify for the function up until about 1024MB

If you told me a year ago that I would be connecting to an OracleDB using a Lambda function, I would most properly have laughed at you. But hey, here we are. Jokes on me.

DynamoDB Importer

Rehan van der Merwe — Sun, 10 Nov 2019 20:00:09 +0000

This blog will demonstrate the high throughput rate that DynamoDB can handle by writing 1 million records in 60 seconds with a single Lambda , that is approximately 17k writes per second. This is all done with less than 250 lines of code and less than 70 lines of CloudFormation.

This is an article can also be found on my blog.

We will also show how to reach 40k writes per second (2.4 million per minute) by running a few of the importer Lambdas concurrently to observe the DynamoDB burst capacity in action.

Knowing the Limits

It is always nice to talk about the limits and capabilities of DynamoDB but few people get to actually test them and walk the walk. This blog originated from this dev.to discussion in the comments. It will also hopefully be one in many DynamoDB blogs to come.

DynamoDB is an Online Transactional Processing (OLTP) database that is built for massive scale. It takes a different type of mindset to develop for NoSQL and particularly DynamoDB, working with and around limitations but when you hit that sweet spot, the sky is the limit. This post will test some of those limits.

The fastest write throughput that I heard of was in this AWS blog post were they was pushed to 1.1 Million records per second! We won’t be going that high, as we would need to contact AWS to get a quota increase, for now we just want to demonstrate the ability and show some code. Enough chit chat, let’s get to the good stuff.

Pseudo code

This app consists of a NodeJS Lambda function that streams a S3 file and then imports it into DynamoDB using the Batch API.

What’s happening behind the scenes:

Streaming the file from S3 allows us to read the file in chunks and not fill the memory. It still counts as a single S3 read operation with the added benefit of not storing the full file in memory.
The data can be generated by running a script in /data-generator/generate.js. It will output a CSV file of 3 million lines of person records with a size of about 250MB.
Rows are read until 25 records are accumulated ; these are then used to create a DynamoDB batch write promise. X amount of DynamoDB batch write promises are stored in an array and will be executed in parallel but limited to only Y concurrent executions.
DynamoDB API Retries are set to 0 on the SDK client. This is so that we can handle partial batch throttles. This is done by adding the unprocessed items returned from the batch write call into an array and then calling the same batch write function recursively. This retry mechanism is hard coded to stop at 15 calls and to have exponential back off with a 50ms jitter between retires.
HTTP Keep Alive on the AWS SDK is turned on with the environment variable: AWS_NODEJS_CONNECTION_REUSE_ENABLED = 1. DynamoDB API operations are usually short lived and the latency to open the TCP connection is greater than the actual API call. Check out Yan Cui’s post here.

For example; if X (batch write promises) = 40 and Y (the parallel execution limit) = 20 it means that the stream will be read until it has 25*40 = 1000 records and then execute 20 batch writes operations at a time until all 40 is complete. Then it will repeat by reading another 40 batches and so on. The S3 stream pausing gives the DynamoDB rate limiting a bit of time to recover, compared to reading ALL the data at once and then just hammering the Dynamo API nonstop. The pausing gives it a bit of a breather.

The rest is just boiler plate to log the throughput, providing arguments through environment variables, comments and handling permissions. The actual code will be far less if this is removed.

Code walk through

The full source code can be found here -> https://github.com/rehanvdm/DynamoDBImporter. I am only going to highlight the few pieces that are of importance.

To prevent a HOT partition while writing to Dynamo, a UUID v4 is used as the Partition Key(PK) with no Sort Key (SK).

A single data record/row is kept small and fits within 1KB so that 1 record equals 1 WCU. This makes the tests and calculations easier. This is a nice tool that gives you the size of a single record as seen below.

Below the batch writing function as explained in Pseudo code with manual retries to handle partial batch failures.

Then the streaming function that reads from S3 and builds the array of batches.

The proof is in the pudding

A single lambda running at full power (Memory = 3GB) can write 1 Million records into a DynamoDB table (first smaller spike on graph). Running a few concurrent lambdas, we can test the burst capacity of the table (right side larger spike).

From the DynamoDB Metric page we can observe the WCU per second

The CloudWatch metrics below shows total capacity consumed for the duration of the test per minute.

What is amazing is that the Batch Write API Latency stayed below 10ms for all of the writes, there were some throttles though but the program handled and retried them correctly.

Things to keep in mind

A few things that I did/noticed while coding and doing the experiment.

The first time the tables WCU’s were scaled to 40k, it took about 1 hour to provision, I hope I didn’t get charged for that hour. After that switching between a lower capacity and back to 40k took about 1 minute.
The AWS SDK TCP keep alive makes a huge difference.
Provisioned capacity billing mode can go from zero to full, almost instantaneously. On demand takes a while to warm up, Yan Cui wrote a great post on this behavior here. There is a “cheat” to get a warm (40k) table right from the get-go with on demand pricing. You can set your table to provisioned 40k WCU and then after it has been provisioned, change back to on demand billing.

3 Ways to Autoscale on AWS

Rehan van der Merwe — Tue, 24 Sep 2019 19:17:08 +0000

In this article, we will be looking at three different methods of Autoscaling applications. We’ll also try to leverage AWS manged services as much as possible.

This is an article written for Afonza, it can be found here or on my blog.

Scaling is when similar resources are added or removed depending on demand, this is usually done manually. An Autoscaling application will monitor key performance metrics and when these cross certain thresholds, either scale out (add similar resources) or scale in (remove similar resources) to meet demand.

Compared to scaling, Autoscaling can happen almost instantaneously depending on the setup. This can help maintain Service Level Agreements (SLAs) by having a Highly Available (HA) setup spread across more than 1 geographic location.

Some definitions that you might find around scaling include:

Scale out – Addition of similar resource to meet demand
Scale in – Removal of similar resource to meet demand
Autoscaling Group – A group of similar resources that are/will be Autoscaled
Desired capacity – This is the count of resources in the Autoscaling Group
Max and min capacity – These define the limits for the number of resources that Autoscaling can add or remove from the Autoscaling group so that it meets demand.
Scalable Dimension – The dimension of the resource that will be used to scale the Autoscaling group, if the resource is a service, this might be the CPU usage or network in.
Scale in and scale out cool down – The amount of time before another scale in or scale out can take place
Health Check – A Health Check is an API call to the application/resource to determine if it can still receive traffic. If it cannot, it will be marked as unhealthy and removed. This is usually with regard to the Load Balancer that sends traffic to the individual resources in the Autoscaling group.
Health Check grace period – This gives the resource/application time to start up, only after this time will the checks begin

It is important to note that not any application can use autoscaling. The application must be designed with scaling kept in mind. It must be stateless to handle new resources and any failures that might prevent your High Available setup from keeping its SLAs.

With most of the terminology out of the way, let’s start looking at our examples. They are ordered by their age and ability to scale.

1. Elastic beanstalk

The CloudFormation template can be found here: https://github.com/rehanvdm/awsautoscaling/blob/master/ElasticBeanstalk/cf.yaml

Elastic Beanstalk (EB) is one of the earliest AWS orchestration services. It is easy to configure and orchestrates a lot of other AWS services like EC2, SQS, RDS, S3 SNS, Autoscaling, Cloudwatch Alarms, Loadbalancer, etc to bring your whole application together as one. Under the hood, the configuration provided to Elastic Beanstalk writes a CloudFormation template to orchestrate an manage all these services for you.

Elastic Beanstalk is often overlooked. It is still a great entry point for anyone starting out on AWS, some might argue that it is outdated and should not be used. It is just so easy to do complex setups while requiring little knowledge of the underlying infrastructure ; this is to me is not a sign of age but of maturity. That is why Elastic Beanstalk is first on the list, it is old and wise, perfect for large websites (my opinion) and shouldn’t be used for any compute intensive work.

The architecture above visualizes the CloudFormation template to define an HA Elastic Beanstalk environment and comes in under 85 lines of YAML. Elastic Beanstalk then creates and manages a much bigger CloudFormation template on your behalf that is about 2500 lines , containing more than 15 resources that are about 1500 lines long. The gist of this architecture can be seen below.

Important points to mention here is that you get a Classic Load balancer that balances over your subnets which could be either in public or private depending on your external VPC setup. Then Elastic Beanstalk creates all the Security Groups, Classic Load Balancer, Target Group, Autoscaling groups, etc.

It will pull the ZIP code from S3 and deploy it to the Elastic Beanstalk environment. Each environment has specific configuration options depending on what programming language you choose. Then Elastic Beanstalk also has script hooks that can be used to setup the instance and container. These can be easily customized to install all kinds of applications and create custom configurations on the actual instances and . Also noteworthy is the Elastic Beanstalk CLI that has many features and commands which can be extremely helpful when migrating to Elastic Beanstalk.

Many may start off with Elastic Beanstalk and then later move to a pure EC2 Autoscaling or ECS. This is to get extra benefits that Elastic Beanstalk cannot provide out of the box. Some of these include; making use of Spot Instances to optimize cost, multi-region deployments and reusing resources like sharing a load balancer for multiple apps.

The whole point of Elastic Beanstalk is to abstract complexity away from you , so that you can only bring your application in a ZIP and then have it autoscaled over multi AZs with minimal setup and knowledge of the underlying infrastructure.

2. ECS Fargate

The CloudFormation template can be found here: https://github.com/rehanvdm/awsautoscaling/blob/master/Fargate/cf.yaml

Elastic Container Service (ECS) is one of the options AWS provides for running your containers. ECS offers two modes of operation where you manage the underlying EC2 instances that run your Docker images yourself or let AWS do it for you, the latter is known as ECS Fargate. Fargate only cares about how much resources (CPU, memory, etc) and the number of container applications it needs to create for a service.

Fargate manages the underlying EC2 host instances and container placement.

Fargate like Elastic Beanstalk can be run with or without Autoscaling but is more complex than the Elastic Beanstalk implementation, this is because we need to define every resource manually. The CloudFormation template specifies 14 resources to run a minimal Fargate setup , coming in at around 200 lines of code without comments. The architecture diagram can be seen below.

It resembles much of the same components as Elastic Beanstalk. The major differences being that; the applications run on containers orchestrated by ECS Fargate and that the application is a now created from a Docker image. Fargate also requires a load balancer to distribute traffic to the service, which is the grouping of individual Docker applications running that may scale in and out depending on demand.

Fargate is great when you already have your applications dockerized and do not want to manage container hosts. Cost-wise it is also perfect for spikey and underutilized Docker applications that might not use the full EC2 instance host. When you have high throughput, constant resource intensive applications that can be tightly packed on a single instance, then ECS will be the better option.

3. Application Load Balancer and Lambda

The CloudFormation template can be found here: https://github.com/rehanvdm/awsautoscaling/blob/master/LambdaALB/cf.yaml

We are sticking with the load balancer approach as the previous solutions have. AWS Lambda could just as well have been invoked by API Gateway, which is the most popular method to call Lambda. Both invocation methods scale Lambda near real-time , depending on the configuration. AWS Lambda scales the fastest by far but has its own unique set of limitations.

*Note the architecture diagram excludes any mention of VPCs on purpose, that is a bit out of scope for this topic.

API Gateway is the most cost-effective when you have spikey, low usage for the API. While the Application Load Balancer ( ALB ) trumps API Gateway when the API requires high/constant throughput.

API Gateway offers other sidecars like authentication, VTL templates, stages, usage plans and throttling that ALB does not. ALB also has its strengths, the biggest being that a connection is not limited to 30 seconds like API Gateway but rather the Lambda limit of 15 minutes.

Whether ALB or API Gateway is used, the Lambda compute engine is using the Firecracker (blazing fast micro VMs ) which starts and scales almost instantly. Lambda definitely requires the least amount of setup and hardware babysitting compared to the other solutions. This can be seen when looking at the CloudFormation template needed to define the minimal setup for ALB and Lambda. The ALB and Lambda CloudFormation template comes in under 100 lines of YAML and requires only 6 other AWS resources compared to the more than 15 resources the other two methods required.

Tests

It is a bit difficult to test and compare 3 different architectures against each other, each with their own unique set of parameters. A simple load test is done with Artillery, which can also be found in the Github repo. The test just calls /fibonaci.php for 10 minutes doing 4 calls per second. The result can be seen in CloudWatch Metrics, depending on the solution you can see EC2 Instances for EB and Tasks for Fargate scaling in and out.

For Elastic Beanstalk and Fargate the outcome is pretty close. Elastic Beanstalk is slower because it needs to create the actual EC2 instance and do a bunch of things to get the instance and container within ready. Baked AMIs will see significant improvement here, this is the recommended process for EC2 Autoscaling. Fargate also uses the same Firecracker micro VMs as Lambda, so they pull and deploy a docker image much faster than what Elastic Beanstalk can do with EC2 instances.

Lambda is obviously the winner when it comes to scale in and scale out timing. It scales near real-time, to be honest, it still baffles me that on a cold start it can pull and deploy your code sub-second (for this repo example +- 500ms).

TL;DR Choose a scaling method that suits your application. Lambda will always scale the fastest.

Summary

The table below gives a quick summary of what we discussed.

	Complexity	Extensibility	Compute Engine	Time to Scale
Elastic Beanstalk	Easy	Moderate	Servers	+++++++++
ECS Fargate	Moderate – High	High	Docker	+++
Lambda	Low – Moderate	High	Serverless Lambda	+

Elastic Beanstalk is a good fit if you have legacy apps requiring scaling with minimal knowledge of AWS.
Use ECS Fargate if your apps are already dockerized and you don’t already have too many sidecars helping with docker orchestration.
AWS Lambda will be the best choice If you are starting something from scratch that has low to moderate traffic. It has other limitations that need to be kept in mind though.

13 AWS Lambda design considerations you need to know about – Part 2

Rehan van der Merwe — Sat, 03 Aug 2019 14:03:19 +0000

The first installment mainly covered the technical side of things, like the limits and configuration options. This next part looks at how to use all the technical considerations we checked out in part one to effectively design serverless and Lambda systems.

By the end of this post, you should have a good understanding of the key considerations to keep in mind when designing around AWS Lambda. Let’s dive in. If you have checked out Part one, pop over and have a read before getting stuck into part two.

This is an article written for Jefferson Frank, it can be found here or on my blog.

General design considerations

9) Types of AWS Lambda errors

How you handle errors and failures all depends on the use case and the Lambda service that invoked the Lambda.

There are different types of errors that can occur:

Configuration: These happen when you’ve incorrectly specified the file or handler, or have incorrect directory structures, missing dependencies, or the function has insufficient privileges. Most of these won’t be a surprise and can be caught and fixed after deployment.
Runtime: These are usually related to the code running in the function, unforeseen bugs that we introduce ourselves and will be caught by the runtime environment.
Soft errors: Usually not a bug, but an action that our code identifies as an error. One example being when the Lambda code intentionally throws an error after retrying three times to call a third-party API.
Timeouts and memory: They fall in a special kind of category as our code usually runs without problems, but it might have received a bigger event than we expected and has to do more work than we budget for. They can be fixed by either changing the code or the configuration values.

Remember—certain errors can’t be caught by the runtime environment. As an example, in NodeJS, if you throw an error inside a promise without using the reject callback, then the whole runtime will crash. It won’t even report the error to CloudWatch Logs or Metrics, it just ends. These must be caught by your code, as most runtimes have events that are emitted on exit and report the exit code and reason before exiting.

When it comes to SQS , a message can be delivered more than once, and if it fails, it’ll be re-queued after the visibility timeout and then retried. When your function has a concurrency of less than five, the AWS polling function will still take messages from the queue and try to invoke your function. This will return a concurrency limit reached exception, and the message will then be marked as being unsuccessful and returned to the queue—this is unofficially called “over-polling.” If you have a DLQ configured on your function, messages might be sent there without being processed, but we’ll say more about this later.

SQS “over-polling”: If the Lambda is being throttled and then messages are sent to the DLQ without being processed.

Then for stream-based services like DynamoDB and Kinesis streams, you have to handle the error within the function or it’ll be retried indefinitely; you can’t use the built-in Lambda DLQs here.

For all other async invocations, if it fails the first invocation, it will retry two or more times. These retries mostly happen within three minutes of each other, but in rare cases it may take up to six hours, and there might also be more than three retries.

10) Handling Errors

Dead Letter Queues (DLQ) to the rescue. Maybe not, DLQs only apply to async invocations; it doesn’t work for services like SQS, DynamoDB streams and Kinesis streams. For SQS use a Redrive Policy on the SQS queue and specify the Dead Letter Queue settings there. It’s important to set the visibility timeout to at least six times the timeout of your function and the maxRecieveCount value to at least five. This helps prevent over-polling, messages being throttled and then being sent to the DLQ when the Lambda concurrency is low.

Alternatively, you could handle all errors in your code with a try-catch-finally block. You get more control over your error handling this way and can send the error to a DLQ yourself.

Now that the events/messages are in the DLQ and the error is fixed, these events have to be replayed so that they’re processed. They need to be taken off the DLQ, and then that event must be sent to the Lambda once again so that it can be processed successfully.

There are different methods to do this and it might not happen often, so a small script to pull the messages and invoke the Lambda will do the trick. Replaying functionality can also be built into the Lambda, so that it knows it if receives a message from the DLQ to extract the original message and run the function. The trigger between the DLQ and the Lambda will always be disabled, but then enabled after the code is fixed and the message can be reprocessed.

AWS Step Functions also give you granular control over how errors are handled. We can control how many times it needs to be retried, the delay between retries, and the next state.

With all these methods available, it’s crucial that your function is idempotent. Even for something complex like credit card transactions, it can be made idempotent by first checking if the transaction with your stored transaction callback ID has been successful, or if it exists. If it doesn’t, then only carry out the credit deduction.

If you can’t get your functions to be idempotent, consider the Saga pattern. For each action, there must also be a rollback action. Taking the credit card example again, the Lambda that has a Create Transaction function must also have a Reverse Transaction function, so that if an error happens after the transaction has been created, it can propagate back and the reverse transaction function can be fired. So that the state is exactly the same as it was before the transaction began. Of course, it’s never this straightforward when working with money, but it’s a solid example.

If you can’t get your functions to be idempotent, consider the Saga pattern. For each action, there must also be a rollback action.

Duplicate messages can be identified by looking at the context.awsRequestId inside the Lambda. It can be used to de-dupe duplicate messages, if a function cannot be idempotent then this should be used. Store this ID in a cache like Redis or a DB to use it in the de-dupe logic; this introduces a new level of complexity to the code, so keep it as a last resort and always try to code your functions to be idempotent.

A Lambda can also look at the context.getRemainingTimeInMillis() function to know how much time is left before the function will end. This is so that if processing takes longer than usual, it can stop, gracefully do some end function logic, and return a soft error to the caller.

11) Coupling

Coupling goes beyond Lambda design considerations—it’s more about the system as a whole. Lambdas within a microservice are sometimes tightly coupled, but this is nothing to worry about as long as the data passed between Lambdas within their little black box of a microservice is not over-pure HTTP and isn’t synchronous. Lambdas shouldn’t be directly coupled to one another in a Request Response fashion, but asynchronously. Consider the scenario when an S3 Event invokes a Lambda function, then that Lambda also needs to call another Lambda within that same microservice and so on.

You might be tempted to implement direct coupling, like allowing Lambda 1 to use the AWS SDK to call Lambda 2 and so on. This introduces some of the following problems:

If Lambda 1 is invoking Lambda 2 synchronously, it needs to wait for the latter to be done first. Lambda 1 might not know that Lambda 2 also called Lambda 3 synchronously, and Lambda 1 may now need to wait for both Lambda 2 and 3 to finish successfully. Lambda 1 might timeout as it needs to wait for all the Lambdas to complete first, and you’re also paying for each Lambda while they wait.
What if Lambda 3 has a concurrency limit set and is also called by another service? The call between Lambda 2 and 3 will fail until it has concurrency again. The error can be returned to all the way back to Lambda 1 but what does Lambda 1 then do with the error? It has to store that the S3 event was unsuccessful and that it needs to replay it.

This process can be redesigned to be event driven:

Not only is this the solution to all the problems introduced by the direct coupling method, it also:

Provides a method of replaying the DLQ if an error occurred for each Lambda.
No message will be lost or need to be stored externally.
The demand is decoupled from the processing. The direct coupling method would have had failures if more than 1,000 objects were uploaded at once and generated events to invoke the first Lambda. This way, Lambda 1 can set its concurrency to be five and use the batch size to only take X amount of records from the queue and thus control maximum throughput.

Going beyond a single microservice, when events are passed between them both needs to understand and agree upon the structure of the data. Most of the time both microservices can’t be updated at the exact same time, so be sure to version all your events. This way you can change all the micro services that listen for event version 1 and add the code to handle version 2. Then update the emitting microservice to emit version 2 instead of 1, always with backwards compatibility in mind.

12) AWS Lambda batching

Batching is particular useful in high transaction environments. SQS and Kinesis streams are some services that offer batching messages, sending the batch to the Lambda function instead of each and every message separately. By batching the values in groups of around 10 messages instead of one, you might reduce your AWS Lambda bill by 10 and see an increase in system performance throughput.

By batching the values in groups of around 10 messages instead of one, you might reduce your AWS Lambda bill by 10 and see an increase in system performance throughput.

One of the downsides to batching is that it makes error handling complex. For example, one message might throw an error and the other nine are processed successfully. Then Lambda needs to either manually put that one failure on the DLQ, or return an error so that the external error handling mechanisms, like the Lambda DLQ, do their jobs. It could be that case that a whole batch of messages need to be reprocessed; here, being idempotent is again the key to success.

If you’re taking things to the next level, sometimes batching is not enough. Consider a use case where you’ve got a CSV file with millions of records that needs to be inserted into DynamoDB. The file is too big to load into the Lambda memory, so instead you stream it within your Lambda from S3. The Lambda can then put the data on a SQS queue and another Lambda that can take the rows in batches of 10 and write them to DynamoDB using the batch interface.

This sounds okay, right? The thing is, a much higher throughput and lower cost can actually be achieved if the Lambda function that streams the data writes to DynamoDB in parallel. Start building groups of batch write API calls, where each can hold a maximum of 25 records. These can then be started and limited to roughly 40 parallel/concurrent batch writes, without much tuning, you will be able to reach 2k writes per second.

13) Monitoring and Observability

Modern problems require modern solutions. Since traditional tools won’t work for a Lambda and serverless environment, it’s difficult to find visibility and to monitor the system. There are many tools out there to help with this including internal AWS Services like X-Ray, CloudWatch Logs, CloudWatch Alarms, and CloudWatch Insights. You could also turn to third-party tools like Epsagon, IOPipe, Lumigo, Thunderbird, and Datadog to just name a few.

All these tools deliver valuable insights in the form of logs and charts to help evaluate and monitor the serverless environment. One of the best things you can do is to get visibility early on and fine-tune your Lambda architecture. Finding the root cause of a problem and tracing the event as it goes through the whole system can be extremely valuable.

Bonus tips and tricks!

Within a Lambda your code can do parallel work. If you’re receiving a batch of 10 messages, instead of doing 10 downstream API calls synchronously, do them in parallel. It reduces the run time as well as the cost of the function. This should always be considered first before moving on to the more complex fan out-fan in.
Lambdas doing work in parallel will usually benefit from an increase in memory.
All functions must be idempotent , and do consider making them stateless. One example we can look at is when functions need to keep some small amount of session data for API calls. Let the caller send the SessionData with the SessionID and make sure both these fields are encrypted. Then, decrypt their values and use it in the Lambda, this can spare you from carrying out repeated external calls or using a cache.
You might not need an external cache ; a small amount of data can be stored above the Lambda function handler method in memory. This data will remain for the duration of the Lambda container.
Alternately, each Lambda is allowed 500 MB of file storage in /tmp , and data can be cached here as a file system call will always be faster than a network call.
Keep data inside the same region , to avoid paying for data to be transferred beyond that region.
Only put your Lambda inside a VPC if it needs to access private services or needs to go through a NAT for downstream services that whitelist IPs.
Remember NAT data transfer costs money, and services like S3 and DynamoDB are publicly available. All data that flows to your Lambdas inside the VPC will need to go through the NAT.
Consider using S3 and DynamoDB VPC Gateway Endpoints—they’re free and you only pay for Interface Endpoints.
Batching messages can increase throughput and reduce Lambda invocations, but this also increases complexity of error handling.
Big files can be streamed from S3.
Step functions are expensive , so try and avoid them for high throughput systems.
If you have a monorepo where all the Lambda functions live for that microservice, consider creating a directory that gets symlinked into each of those Lambda directories. Shared code only needs to be managed in one place; alternately, you could look into putting shared code into a Lambda layer.
A lot of AWS Services make use of encryption to protect data, so consider using that instead of encrypting on an application level.
The less code you write the less technical debt you build up.
Where possible use the AWS Services as they are intended to be used , or you could end up needing to build tooling around your specific use case and manipulate those services. This once again adds to your technical debt. AWS Lambda scales quickly and integration with other systems—like a MySQL DB with a limited amount of open connections—can quickly become a problem. There is no silver bullet for integrating a service that scales and one that doesn’t. The best thing that can be done is to either limit the scaling or implement queuing mechanisms between the two.
Environment variables should not hold security credentials , so try using AWS SSM‘s Parameter Store instead. It’s free and is great for most use cases; when you want higher concurrency consider using Secret Manger, it also supports secret rotation for RDS but it comes at higher cost than Parameter Store.
Consider using Infrastructure as Code (IaC) if you aren’t.
API Gateway has the ability to proxy the HTTP request to other AWS Services, eliminating the need for an intermediate Lambda function. Using Velocity Mapping Templates you can change the normal POST parameters into the arguments that DynamoDB requires for the PUT action. This is great for simple logic where the Lambda would have just transformed the request before doing the DynamoDB command.

So what’s the bottom line?

Always try to make Lambda functions stateless and idempotent , regardless of the invocation type and model. Lambdas aren’t designed to work on a single big task, so break it down to smaller tasks and process it in parallel. After that, the single best thing to do would be to measure three times and cut once; do a lot of upfront planning, experiments, and research and you’ll soon develop a good intuition for serverless design.

13 AWS Lambda design considerations you need to know about – Part 1

Rehan van der Merwe — Sat, 03 Aug 2019 13:32:02 +0000

When you hear the word ‘serverless’, AWS Lambda is most likely the first thing you think about. That’s no surprise; the tech hit our industry by storm and brings with it a whole new paradigm of solutions. AWS Lambda was the first Function as a Service (FaaS) technology I was exposed to, and like others, I was highly critical at first. There are no servers to manage, it auto-scales, has fault tolerance built-in, and is pay per usage —all of which sounds like a dream.

With great power comes great responsibility. Serverless design requires knowledge of different services and how they interact with each other. Just like any other technology, there are some tricky waters to navigate, but they are far outweighed by the power of what serverless has to offer. To stop this dream from turning into a nightmare, here are a few things to keep in mind when designing with AWS Lambda.

In this two-part series, we’ll be diving into the technical details, like configuration options and any limitations you need to know about. The second part will focus on how to use the technical considerations we cover in part one to effectively design serverless and Lambda systems. At the end of it all, you should have a clearer understanding of the key considerations you need to bear in mind when designing around AWS Lambda.

This is an article written for Jefferson Frank, it can be found here or on my blog.

Technical considerations

1) Function Memory

The memory setting of your Lambda determines both the amount of power and unit of billing. There are 44 options to choose from between the slowest or lowest 128 MB, and the largest, 3,008 MB. This gives you quite a variety to choose from! If you allocate too little memory, your program might take longer to execute and might even exceed the time limit, which stands at 15 minutes. On the other hand, if you assign too much memory, your function might not even use a quarter of all that power and end up costing you a fortune.

It’s crucial to find your function’s sweet spot. AWS states that if you assign 1,792 MB, you get the equivalent of 1 vCPU, which is a thread of either an Intel Xeon core or an AMD EPYC core. That’s about as much as they say about the relationship between the memory setting and CPU power. There are a few people who’ve experimented and come to the conclusion that after 1,792 MB of memory, you do indeed get a second CPU core and so on, however the utilization of these cores can’t be determined.

Cheaper isn’t always better —sometimes choosing a higher memory option that is more expensive upfront can reduce the overall execution time. This means that the same amount of work can be done within a smaller time period, so by fine-tuning the memory settings and finding the optimal point, you can make your functions execute faster as opposed to the same low memory setting. You may end up paying the same—or even less—for your function than with the lower alternative.

The bottom line is that CPU and memory should not be high on your design consideration list. AWS Lambda, just like other serverless technologies, is meant to scale horizontally. Breaking the problem into smaller, more manageable pieces and processing them in parallel is faster than many vertically scaled applications. Design the function and then fine-tune the memory setting later as needed.

Breaking the problem into smaller manageable piecesand processing them in parallel is faster than many vertically scaled applications.

2) Invocation

AWS Lambda has two invocation models and three invocation types. What this means is that there are two methods of acquiring the data and three methods through which the data is sent to the Lambda function. The invocation model and type determine the characteristics behind how the function responds to things like failures, retries and scaling that we’ll use later on.

Invocation models:

Push: when another service sends information to Lambda.
Pull: an AWS managed Lambda polls data from another service and sends the information to Lambda.

The sending part can then be done in one of three ways, and is known as the invocation type :

Request Response: this is a synchronous action; meaning that the request will be sent and the response will be waited on. This way, the caller can receive the status for the processing of the data.
Event: this an asynchronous action; the request data will be sent and the Lambda only acknowledges that it received the event. In this case, the caller doesn’t care about the success of processing that particular event. Its only job was to deliver the data.
Dry Run: this is just a testing function to check that the caller is permitted to invoke the function.

Below are a few examples that showcase the different models and invocation types available:

API Gateway Request is a Push model and by default has a Request Response invocation The HTTP request is sent through to the Lambda function, the API gateway then waits for the Lambda function to return the response.
S3 Events notifications, SNS Message, Cloudwatch Events is a Push model and Event invocation
SQS Message is a Pull model and a Request Response invocation AWS has a Lambda function that pulls data from the Queue and then send it to your Lambda function. If it returns successfully, the AWS-managed polling Lambda will remove it from the queue.
DynamoDB Streams and Kinesis Streams are Pull models and have a Request Response invocation. This one is particularly interesting as it pulls data from the stream and then invokes our Lambda synchronously. Later, you’ll be see that if the Lambda fails it will try and process that message indefinitely (or until it expires), keeping other messages from being processed as a result.

To my knowledge, there are no Pull models that do Event type invocations. Pull models are further divided into two sections, stream-based and non-stream based. Also, note that the API Gateway invocation type can be changed to Event (async) by adding a header before sending the data to the Lambda.

3) Failure and retry behavior

This is most probably one of the most important considerations: how a Lambda fails and retries is based on the invocation type. For all Event -based invocations, if Lambda throws an error it will be invoked two more times—so three times in total, separated by a delay. If a Dead Letter Queue (DLQ) is configured, the message will be sent to the configured SQS or SNS topic, or the error will just be sent to CloudWatch.

With the RequestResponse invocation type, the caller needs to act on the error returned. For API Gateway (Push + Request Response) the caller can maybe log the failure, then retry again. When it comes to Kinesis Streams (Pull stream-based + Request Response) it acts as a FIFO queue/stream. Which means if the first message is processed in error by the Lambda, it will block the whole stream from being processed until that message either expires or is processed successfully.

Idempotent system: A system will always output the same result given the same input.

It’s important to understand the failure and retry behavior of each invocation type , as a general rule of thumb, design all your functions to be idempotent. This basically just means that if the function is invoked multiple times with the same input data then the output will/must always be the same. When you design like this, the retry behavior will not be a problem in your system.

4) Versions and Aliases

AWS provides Versions and Aliases out of the box for your Lambda code. This might not be as straightforward and useful as you would think. A few things to keep in mind:

Versioning only applies to the Lambda code, not to the Infrastructure that it uses and depends on.
Once a version is published, it basically becomes read-only.

There are three ways in which you can use versioning and aliases. A single Lambda function that gets a new version number whenever there is a change to code or configuration. The alias will be used as the stage and pointed to the correct version of the Lambda function.

Again, it’s imperative to note that if something for the older versions, for example, version 3 (now the Live alias/stage) needs to change it cannot, so you can’t even quickly increase the timeout setting. In order to change it, you would need to redeploy version 3 as version 5 with the new setting and then point the Live alias to version 5. Then keeping in mind that Version 5 is actually older than version 4, this gets unnecessarily complex very quickly.

The second method that comes to mind is a blue green deployment. Which is a little less complex where you would have three different Lambdas, one for each stage—blue being the old version and green being the new version. Just like before each new deployment of a Lambda is versioned. Then when you are ready to make the new code changes live, you create an alias that specifies, for example, 90% of traffic uses the old version and then 10% of the requests go to the new version. This is called Canary Deployments, although AWS doesn’t label it as such, it allows you to gradually shift traffic to the new version.

The third method is the simplest and plays nicely with IaC (Infrastructure as Code) tools like CloudFormation, SAM and CICD (Continuous Integration Continuous Deployment) pipelines. It’s based on the principle that each Lambda is “tightly” coupled with its environment/infrastructure. The whole environment and Lambda are deployed together, any rollback will mean that a previous version of the infrastructure and Lambda needs to be deployed again. This offloads the responsibility of versioning to the IaC tool being used. Each Lambda function name includes the stage and is deployed as a whole, with the infrastructure.

5) VPC

The main reason to place a Lambda inside a VPC is so that it can access other AWS resources inside the VPC on their internal IP addresses/endpoints. If the function does not need to access any resources inside the VPC, it is strongly advised to leave it outside the VPC. The reason being that inside the VPC each Lambda container will create a new Elastic Network Interface (ENI) and IP address. Your Lambda will be constrained by how fast this can scale and the amount of IP addresses and ENIs you have. [UPDATE: See end for: Improved VPC networking]

As soon as you place the Lambda inside the VPC, it loses all connectivity to the public internet. This is because the ENIs attached to the Lambdas only have private IP addresses. So it is best practice to assign the Lambda to three private subnets inside the VPC, then connect the private subnets to go through a NAT in one of the public subnets. The NAT will then have a public IP and send all traffic to the Internet Gateway. This also has a benefit that the egress traffic from all Lambdas will come from a single IP address, but it introduces a single point of failure, this is of course mitigated by using the NAT Gateway over the NAT instance.

6) Security

As with all AWS services, the principle of least privilege should be applied to the IAM Roles of Lambda functions. When creating IAM Roles, don’t set the Resource to all (*), set the specific resource. Setting and assigning IAM roles this way can be annoying, but is worth the effort in the end. By glancing at the IAM Role you will then be able to know what resources are being accessed by the Lambda and then also how they are being used (from the Action attribute). It can also be used for discovering service dependencies at a glance.

7) Concurrency and scaling

If your function is inside a VPC, there must be enough IP addresses and ENIs for scaling. A Lambda can potentially scale to such an extent that it depletes all the IPs and/or ENIs for the subnets/VPC it is placed in. [UPDATE: See end for: Improved VPC networking]

To prevent this, set the concurrency of the Lambda to something reasonable. By default, AWS sets a limit of 1000 concurrent executions for all the Lambdas combined in your account, of which you can assign 900 and the other 100 is reserved for functions with no limits.

For Push model invocations (ex: S3 Events), Lambda scales with the number of incoming requests until concurrency or account limit is reached. For all Pull model invocation types, scaling is not instant. For the stream-based Pull model with Request Response invocation types (ex: DynamoDB Streams and Kinesis) the amount of concurrent Lambdas running will be the same as the amount of shards for the stream.

As opposed to the non-stream based Pull model with Request Response invocation types (ex: SQS), Lambdas will be gradually spun up to clear the Queue as quick as possible. Starting with five concurrent Lambdas, then increasing with 60 per minute up to 1000 in total, or until the limits are reached again.

8) Cold starts

Each Lambda is an actual container on a server. When your Lambda is invoked it will try to send the data to a warm Lambda, a Lambda container that is already started and just sitting there waiting for event data. If it does not find any warm Lambda containers, it will start/launch a new Lambda container, wait for it to be ready and then send the event data. This wait time can be significant in certain cases.

Take note that if a container does not receive event data for a certain period it will be destroyed, reducing the number of warm Lambdas for your function. Certain compiled type languages like Java can take many seconds for a cold start, whereas interpreted languages like JavaScript and Python usually takes milliseconds. When your Lambda is inside a VPC, the cold start time increases even more as it needs to wait for an ENI (private IP address) before being ready.

Even milliseconds can be significant in certain environments. The only method to keep a container warm is to manually ping it. This is usually done with a Cloudwatch Event Rule (cron) and another Lambda, the cron can be set for five minutes. The CloudWatch rule will invoke the Lambda that will ping the function that you want to keep warm, keep in mind that one ping will only keep one warm Lambda container alive. If you want to keep three Lambda containers warm, then the ping Lambda must concurrently invoke the function three times in parallel.

[UPDATE: Improved VPC networking]

The Hyperplane now creates a shared network interface when your Lambda function is first created or when its VPC settings are updated, improving function setup performance and scalability. This one-time setup can take up to 90 seconds to complete.
Functions in the same account that share the same security group:subnet pairing use the same network interfaces. This means that there is not longer a direct correlation between Concurrent Lambdas and ENIs.