DEV Community: Claudio Taverna

Stop Drawing, Start Coding: Modern AWS Architecture with LikeC4

Claudio Taverna — Wed, 17 Dec 2025 11:44:19 +0000

What is the C4 Model?

The C4 model is an approach for structuring and visualizing software architecture design. It was created by Simon Brown.

It aims to provide different levels of detail, much like maps: from a high-level view down to the code.
It stands for Context, Containers, Components, and Code.
These are the four hierarchical levels of abstraction:

Context (Level 1)

The Big Picture. Shows the software system and its relationships to users and other systems.

Focus: Why the system exists and who it interacts with.

Audience: Technical and non-technical stakeholders (everyone).

Containers (Level 2)

The Deployable Units. Breaks down the system into its major building blocks (e.g., web app, mobile app, database, microservice, serverless function).

Focus: How responsibilities are distributed across separate running processes.

Audience: Technical people (Architects, Developers, Ops).

Components (Level 3)

The Modules inside a Container. Zooms into a single Container to show its internal structure.

Focus: How the code is organized into logical groups (e.g., services, repositories, controllers).

Audience: Developers and Architects.

Code (Level 4)

The Deepest Detail. Focuses on the implementation details, often using UML, ERDs, or generated code-level diagrams.

Focus: Specific classes, interfaces, and functions.

Audience: Developers.

The key advantages of this approach:

Standardization: A common language for discussing architecture.
Progressive Detail: Stakeholders can choose the level of detail they need.
Clarity: It focuses on boxes and lines (simple notation) rather than complex UML shapes, making it easier to read.

Advantages of "Diagram as Code"

When documenting complex modern systems, treating your architecture diagrams as code (Diagram as Code or "D.a.C.") offers significant benefits over traditional drawing tools.

Speed and Efficiency

Faster Creation: You define the architecture using a text-based language instead of dragging and dropping shapes. This is typically much quicker.
Rapid Updates: Changes to the architecture can be implemented and propagated across multiple diagrams instantly by editing a single source file.
Reduced Friction: No need to open external desktop applications. Diagrams can be edited directly in your IDE (e.g., VS Code).

Consistency and Accuracy

Single Source of Truth: The architecture definition is in one place. Every diagram level (Context, Container, Component) is guaranteed to be consistent.
No Stale Diagrams: By integrating the diagram source files into your version control system, the diagrams are versioned alongside the code they represent. They are always up-to-date with the codebase.
Easy Review: Diagrams can be reviewed in a Pull Request just like source code. This enforces architectural rigor and collaboration.
ADR friendly: Keeping diagrams updated is a great way to help you identify when a new Architectural Decision Record (ADR) should be created. Once your diagrams are aligned, every PR in the code of your diagrams should probably be the consequence of a new ADR. And viceversa, of course.

Culture and Developer Ownership

Closer to Code: Defining architecture in a text file (rather than a GUI tool) makes documentation feel like code. This lowers the barrier to entry for developers.
Increased Ownership: Developers are comfortable with text files, IDEs, and Git. This integration helps the development team take greater ownership of the architecture diagrams, ensuring they are maintained actively.
Source of Truth Shift: The architectural definition lives within the repository, making it a first-class citizen alongside the source code.

Maintainability and Automation

Version Control: You get a full history of architectural changes, allowing you to track who changed what and when. You can easily revert to previous versions.
Text-Based Diff: Comparing architectural versions becomes simple; Git shows the specific lines of text that changed, not just a binary file difference.
Automation: The text files can be processed and rendered as part of your CI/CD pipeline. This means the documentation is automatically published upon every merge.
Searchable: The architectural definition is plain text, making it easily searchable by developers.

Why LikeC4 is a Good Option

As a software architect working extensively with AWS and complex distributed systems, I have found that certain tools dramatically improve documentation efficiency.

This is my opinionated list of the core advantages that make LikeC4 the preferred choice for C4 modeling.

Relationship Focus: Core definition focuses on explicit relationships, accurately mapping communication paths.
Interactive Web Viewer: Generates a fast, modern web application for viewing. Stakeholders can easily navigate and drill down through C4 levels.
Easy Deployment: The output is a static web viewer, perfect for deployment to common cloud services like AWS S3/CloudFront.
Embeddable: Simple to integrate into existing documentation (Confluence, GitBook, etc.).
Automatic Layout: Architects focus only on the elements and relationships; LikeC4 handles the complex visual arrangement, saving time.
IntelliSense/Validation: Integrates with IDEs (e.g., VS Code) for auto-completion and syntax checking, minimizing errors.
Tags and Styling: Easy tagging (e.g., tag:aws-lambda) allows for custom styling, enhancing readability and technology recognition.

Let's Build an Example of an Architecture Documentation with LikeC4

To truly appreciate the power of Diagrams as Code and LikeC4, let's move from theory to practice.

We will document a common, modern, cloud-native architecture using LikeC4's DSL (Domain Specific Language).

Imagine we are documenting a small, serverless e-commerce application running on AWS. The system needs to:

Allow customers to browse products and place orders via a frontend.
Provide a backend API for processing transactions.
Store data in a highly scalable and resilient manner.

This scenario is perfect for illustrating the C4 Model's multiple views and LikeC4's ability to reuse component definitions.

Step 1: Setting up the Project Structure

To build a documentation that scales, we shouldn't put everything in one file. We will split our architecture into logical files: global definitions, our specific system, and the views.

My suggestion is to use VSCode and the great LikeC4 Extension that will enable you to see the diagram preview in real time inside VSCode itself.

First, we create a basic project structure. We will place all our LikeC4 definition files (using the .c4 extension) inside a dedicated


 directory.



``` bash
my-git-repository/
├── docs/
|   ├── specs.c4             # Define the vocabulary
│   ├── workspace.c4         # Define how the company projects are organized
│   ├── ecommerce/
|       ├── model.c4         # Define the system structure and relations
│       ├── views.c4         # Define the views that we need
└── ...

Step 2: Defining the Language

First, we define our "vocabulary." This ensures consistent styling across all diagrams.

// docs/specs.c4

specification {
  element actor {
    style { shape person }
  }
  element system
  element container
}

Step 3: The corporate landscape

Here we define the "Global Context."

This includes, for example, an External Warehouse System that provides stock data.

// docs/workspace.c4

// External systems
model {

  // External System defined at the corporate level
  warehouse_api = system "Warehouse Management System" {
    description "External legacy system providing stock availability."

    style {
      color muted
    }
  }
}

Step 4: The E-commerce Architecture

Now, we define our internal system.

Notice, as a suggested best practice, how relationships are defined inside the source element (the active actor).

// docs/ecommerce/model.c4

model {

  customer = actor "Customer" "A retail customer" {
      description "A user who browses and purchases products via the e-commerce platform."

      // High level relationship to the ecommerce system
      -> ecommerce "Uses"
      // Define how the user enters the system
      -> ecommerce.static_assets "Browses via Browser"
    }


  ecommerce = system "E-commerce Platform" {
    description "Allows customers to browse and purchase products online."

    // High level relationship to external warehouse system
    -> warehouse_api "Checks stock availability from"

    // 1. Frontend Layer
    spa = container "Web Portal" "React Application" {
      -> api_gateway "Sends requests to"
    }

    static_assets = container "S3 Bucket" "Static Web Hosting" {
      -> spa "Serves JS/Assets to"
    }

    // 2. API Layer
    api_gateway = container "API Gateway" "REST Interface" {
      -> order_service "Routes requests to"
    }

    // 3. Logic Layer
    order_service = container "Order Lambda" "Node.js / TypeScript" {
      description "Handles order placement and validation."

      -> db "Reads/Writes data"
      -> inventory_proxy "Requests stock check"
    }

    inventory_proxy = container "Inventory Lambda" "Python" {
      description "Integrates with external warehouse."

      // Relationship to external system defined here (the caller)
      -> warehouse_api "Fetches stock levels via REST API"
    }

    // 4. Data Layer
    db = container "Products & Orders" "DynamoDB" {
      description "Stores product catalog and order information."
      style {
        shape cylinder
      }
    }
  }
}

Step 5: Visualizing the Architecture

Finally, we create some sample views.

LikeC4 allows for very clean view definitions with powerful filtering.

// docs/ecommerce/views.c4

views {

  // 1. System Context View
  view index {
    title "System Context - E-commerce Platform"
    include *
    exclude ecommerce.* // Only show the high-level boundaries
  }

  // 2. Backend Container View (AWS focus)
  view containerView of ecommerce {
    title "AWS Backend Architecture"

    // Include backend containers within ecommerce plus the external warehouse

    include ecommerce
    include ecommerce.api_gateway
    include ecommerce.order_service
    include ecommerce.inventory_proxy
    include ecommerce.db
    include warehouse_api

    autoLayout LeftRight
  }

  // 3. Specific focus on Inventory Integration
  view inventoryFlow of ecommerce {
    title "Integration Detail: Inventory Check"
    include 
      ecommerce.order_service,
      ecommerce.inventory_proxy,
      warehouse_api
  }
}

LikeC4 CLI and local testing

If you have used VSCode and LikeC4 extension you have already seen how it works in the live preview inside VSCode, but now it's time to see the real power of this solution.

With just these four files in our local directory, we have defined a full, navigable C4 model.
Now, let's install LikeC4 CLI:

npm install -D @likec4/cli

Check LikeC4 binary installation and version:

likec4 --version

And finally start a local dev server to preview LikeC4 views:

likec4 start

This will start the local server and open your browser at the address http://localhost:5173.

The first web version of our DaC is now available and ready to be explored:

Feel free to play with your local deployment, enjoying the features of LikeC4.

This is just a first release of our diagrams, in future articles we will improve the styling and will deep dive into best practices, but I think that for now it's enough to understand the power.

If everything is like expected, we are finally ready to deploy our brand new DaC on AWS.

Shipping your Architecture Docs to AWS (The Manual Way)

Now that we have our .c4 files ready, it’s time to move from "Code" to "Cloud".

In this section, we will manually build our documentation and host it on AWS.

We’ll use Amazon S3 for storage and Amazon CloudFront as our Content Delivery Network (CDN) to serve our diagrams securely via HTTPS.

Step 1: Build the Static Site Locally

First, we need to transform our DSL into a set of web-ready files. LikeC4 CLI tool can generate an optimized Single Page Application.

Run the following command:

# Generate the static site in the 'dist' folder
likec4 build -o ./dist

Inside the ./dist folder, you will now see an index.html file and an /assets folder.
This is your entire architecture portal, ready to be hosted.

Step 2: Create a Private S3 Bucket

Even though we are hosting a website, we won't use the "S3 Static Website Hosting" feature.

Instead, we’ll keep the bucket private and serve the content exclusively through CloudFront for better security and performance.

Log into the AWS Management Console, navigate to S3 and create a new bucket.
Choose a bucket name in your preferred region
Object Ownership: Leave as "ACLs disabled."
Block Public Access settings: Keep "Block all public access" checked.
Bucket versioning: not needed

Step 3: Upload your Files

Now open your newly created bucket.

Click Upload
Drag and drop all the contents of your local dist folder (not the folder itself, just the files inside) into the S3 console.
Confirm and upload the files (~5MB)

Step 4: Configure CloudFront for Secure Delivery

Now, we’ll set up CloudFront to act as the gateway to our private S3 bucket.

Navigate to the CloudFront console and click Create distribution (Free plan is ok).
Type a name and continue
Origin type: S3
S3 Origin: Select your S3 bucket from the dropdown
Default root object: "index.html"
Private S3 bucket access: Leave the default settings
Security options: Leave the default settings
Cache settings: Leave the default settings
Click Create distribution and wait the deployment to be completed

Step 5: View Your Architecture

It will take a few minutes for the CloudFront distribution to deploy (Status: "Enabled").

Find and copy your distribution domain name (e.g., abcdd123456789.cloudfront.net).
Paste it into your browser.

Congratulations! You now have a professional, secure, and globally available architecture documentation site.

Your stakeholders can now navigate through the layers of your AWS system, from the high-level Context down to specific Lambda components.

Here is my LikeC4 demo deployment if you want to take a look.

Step 6: Embedding the Diagram (The "Live" Documentation)

Now that our architecture is live on CloudFront, we can do something powerful: embed it directly into our technical documentation, wiki, or even this blog article.

Unlike a static PNG export, embedding the interactive viewer allows your readers to zoom, pan, and click to navigate between the views without leaving your page.

Plus, since it points to your CloudFront URL, whenever you update the code and redeploy, your blog post will be automatically updated with the latest architecture!

Here is the simplest way to do it using a standard HTML iframe:

<iframe 
  src="https://abcd123456789.cloudfront.net/index.html?view=myView" 
  width="100%" 
  height="500px" 
  style="border: 1px solid #e5e7eb; border-radius: 8px;"
></iframe>

The Result

The DEV platform does not allow iframes or any javascript, but you can find a working example of this live integration in the original post from my blog.

Why this matters

As already said, there are several advantages with this approach:

Interactivity: Stakeholders can explore the model themselves.
Single Source of Truth: You don't need to copy-paste screenshots every time you rename, add or update something.
Contextual Deep Links: You can link directly to specific views (e.g., ?view=myView) to highlight specific discussions in your online documentation, wiki or article.

What's Next?

Manual deployment is great for learning, but in real-world "Architecture as Code" or "Diagram as Code" workflows, we want this to happen automatically whenever we commit changes.

This blog post was just the first taste of LikeC4. In the next articles, we will deep dive into all LikeC4 features, improve the styling, the files structure, and automate the entire Diagram as Code process using tools like Terraform and GitHub Actions.

OpenSearch metrics challenge: can you spot the performance flaw?

Claudio Taverna — Tue, 07 Jan 2025 13:14:52 +0000

How a deep analysis of metrics can reveal a hidden issue

Introduction

If you run an OpenSearch cluster, as with all other cloud services, performance and cost optimization are a must.
By carefully analyzing performance metrics, we can uncover hidden issues and significantly boost the efficiency of our systems.

In this article, we will explore a specific real world case involving an error in the code, and how a deep analysis of the metrics led to a surprising revelation and a subsequent performance improvement.

To make the reading experience a little bit more interactive, instead of simply starting from the ending revelation, let's try to live the same experience that our team faced while managing the system.
If you don't want to play this little game, go ahead and you'll find the interesting details in the paragraph "The hidden anomaly" below.

The project context

The OpenSearch cluster was used in our system for different application features, but mostly as a regular document store, with documents added and updated at a high frequency and queries executed mostly by Id.
The indexes were configured with 5 shards and 1 replica, and all documents had a consistent schema.
As often happens, the adoption of OpenSearch started with a single small feature and gradually more features were added.

In the previous months, the growing performance requirements and the high cpu usage brought the original 3-node cluster to grow up to 12 nodes (3 master nodes + 9 data nodes) and worryingly it seemed that the performance couldn't improve anymore, even adding more nodes.

Three high-level metrics

This is the recurring daily shape that we see in many metrics, due to the load pattern that follows the business hours, with the lunch break in the middle of the hill.

AWS offers a lot of different metrics, but in this article, we will focus on just 3 of them.
All the following charts represent the average value for each metric in a time range of 3 days.

Indexing Data Rate (IndexingRate): The number of indexing operations per minute.
HTTP requests by response code: The number of requests to a domain.
Search rate (SearchRate): The number of search operations per minute for all shards in the cluster.

The Challenge: Looking at these metrics and knowing the context given before, can you see anything strange in these metrics?

Need a little help?

If you have already found the anomaly, congratulations, you're a true master of OpenSearch monitoring!
Otherwise, try to read the following clues, which will gradually lead you to the solution.

Clue n.1

The anomaly is not related to the shape of the metric chart. Okay, this clue is important but probably it doesn't help so much. Let's go on.

Clue n.2

Focus on the metric SearchRate. Anything yet?

Clue n.3

Compare the metric SearchRate and HTTPRequests. Uhm... yes, the unbalance between the morning hill and the afternoon hill is strange, but here it's not meaningful.

Clue n.4

Think about this detail given in the context: "queries are mostly executed by Id". Still nothing?

Clue n.5

Consider the fact that searches per Id should actually contribute to the SearchRate metric, but should be performed only on the shard that is relevant, not all shards.

The hidden anomaly

The interesting point here, that has been in front of our eyes but remained unnoticed for months, is the scale of the SearchRate metric.

The shape of the metric was pretty much regular and well overlapped with the others, but we missed an important detail.

Let's look at the top values during peak hours for these metrics:

8,000 HTTP requests per minute
80,000 searches per minute

Why is the search rate ~10 times higher than the request rate?

As said the indexes are configured to have 5 shards, so for each HTTP request there should be at most 5 search operations, one for each shard of the index, but here we can see an unexplainable 10x ratio.

Also, if we think that these searches are not full-text searches but are simple per-Id queries, for each request we should see just one search, because after hashing the document id and applying the internal routing algorithm, we should expect the OpenSearch engine to go straight and ask for the document only to the relevant shard.

So, what are we missing here?

Identifying the Root Cause

Our investigation revealed that the root cause of the anomaly was due to how our queries were executed.
The queries were executed by our components leveraging a common library shared by many of our Java micro-services. Because of a bug introduced years before, probably when the engine was still ElasticSearch, the queries were always run without specifying the index pattern.
The caller software was regularly executing each query specifying both the document Id and the index pattern, but the index pattern parameter was routinely lost inside the shared library and the resulting query that was sent to OpenSearch was lacking of it.

The default behavior of OpenSearch, not so obvious for people thinking with a traditional relational data approach, is that if you don't specify the index name or the index pattern, the search operation is performed on the whole cluster.
This way, every single query, instead of being addressed to the right index and to the right shard, was addressed to each index with a compatible schema, with a significant and unintended consumption of resources.

Fixing the Issue

Once we identified the root cause, we corrected our code to ensure that the index pattern was always specified in the queries.
This small yet crucial change led to a significant improvement in performance and resource utilization.

Let's see what happened when we deployed the new software version of the shared library that fixed the error.

HTTP Requests

This is the HTTP requests metric, where we can see the usual system load and the interruption created by the maintenance window while the deployment was made:

SearchRate

Immediately after the deploy, we could see an amazing drop in the number of searches operations:

Average Request Time

The highest level representation of the real impact of the perfromance improvement, as perceived by the caller clients, is the request time measured by DataDog APM.

ThreadPoolSearchQueue

This is another metric that highlighted very well the load reduction on the cluster. It represents the number of queued tasks in the search thread pool.

The documentation also states that if the queue size is consistently high, consider scaling your cluster.

Scaling-in the cluster

Of course, given the situation, we could evaluate an unexpected horizontal scale-in of our cluster, in order to reduce our cloud cost. The next consequent step was a progressive reduction of the number of nodes in the cluster.
This is what happened during a 2 weeks time window in which we gradually reduced the number of data nodes from 9 to 6, and eventually to 3.

And here is the last chart, where we can see that the impact on the cluster CPU utilization was almost unnoticeable, even if we were executing the queries on a cluster whose data nodes pool had been reduced by 66%.

Lessons Learned

This experience taught us several valuable lessons.

Firstly, specifying the index name or the index pattern in search queries per Id is crucial to prevent waste of resources.
Quite an obvious concept, but not so easily noticeable if the lack of index reference has been introduced by mistake.
OpenSearch's good performances can hide very well the additional load created by unoptimized queries.

Secondly, monitoring performance metrics can reveal hidden issues that might otherwise go unnoticed and can have very bad consequences. But this kind of monitoring requires great attention to detail, a deep understanding of metrics and logs, and deepening down into hundreds of metrics is very time-consuming.
It is really easy to be deceived by superficial analysis.

Lastly, as a developer, never forget that the impact of a simple bug due to distraction, like forgetting to pass an optional parameter, can cost so much more than you can imagine.

Hands-on introduction to Apache Iceberg

Claudio Taverna — Mon, 28 Oct 2024 22:12:39 +0000

Lessons learned through a PoC for a challenging use-case

Introduction

Apache Iceberg format is a great solution that enables a variety of interesting options for cost-effective data collection.
The learning curve can be quite steep, but with the right knowledge and some starting tips can become a very interesting tool.

What is Iceberg

Apache Iceberg is an open source table format, designed to store large data tables. It is based on simple files that can be stored anywhere, and works well with data processing and analytics engines like Apache Spark, Hive, Trino and similar tools.

Iceberg tables are designed to be scalable, durable, and high-performing.

The format supports a variety of data formats, including JSON, CSV, and Parquet.

Tables can be partitioned, which allows for more efficient storage and retrieval of data using also parallel processing, if needed.

How it works and how data is stored

The basic concept on which Iceberg is based is dividing the dataset into smaller chunks, each of which is represented as a separate data file (the lower layer in the diagram below).

Each data file is assigned a unique ID, and each file can be linked by one or more manifest files, each one containing one row for each underlying data file, together with its metrics.

Groups of manifest files are then stored into manifest list files, each one containing references to multiple manifest files, together with stats and data file counts.

These layers of metadata are used like a sort of index to avoid reading manifests and data files that are not required for an operation.
Both storage usage and retrieval times are reduced.

Let's look at how it works with a couple of examples:

If a user wants to read a specific record in a dataset, they can specify the record's ID and Iceberg will only read some metadata files and then retrieve only the data file that contains the record.

If a query needs to perform a count operation, or any other aggregation value like min, max, average and so on, it is likely that only metadata files will be read, avoiding an expensive full table-scan. This is because, if properly configured, the metadata layer itself stores statistics for each data column in each data file.

Other advantages of Iceberg

Apache Iceberg offers many advantages that makes it a great solution for data warehouses and data lakes.

Schema evolution

When the schema of the data is updated, these changes are just metadata changes, and existing data files don't need to be rewritten to consolidate the update.

Hidden partitioning

Iceberg supports a variety of partitioning strategies, and tables can be partitioned on multiple columns. Partitioning offers the benefit of more efficient querying and updating of the data, because it can quickly locate the subset of data files that is relevant.

In addition to this, Iceberg handles hidden partitioning, meaning that the client doesn't need to be aware of it, and can normally query the data, without explicitly reference partition-dedicated columns.

Also, if you need for any reason to change the partitioning, you can fix your table without having to plan an expensive migration, because the partition scheme can change over time, and Iceberg can handle this without ever touching the existing data.

Time travel

Iceberg stores data as a series of snapshots..
Each one is a point-in-time snapshot of the data, which can be used to restore the state of the data in that specific instant. This pattern makes possible time travel, which means that you can query the data at any point in time, not just the current content at query-time.

It's up to you to decide how many snapshot you want to keep, configuring Iceberg according to your needs.

A challenging use case

In the context of a web application for the monitoring of fleet of vehicles, we decided to try Apache Iceberg for a challenging use-case.

Size of data

Our need was to store about 1 million records every day, and to make them available for at least 10 years. Each record is a medium-length data structure that can vary from few to some kilobytes.

Near-real-time data feeding

The goal was also to make clients able to get their data on a per-day basis, but also to get data near-real-time, meaning new data being available just a few minutes after it is produced in the main platform.

Reading pattern

The data must be selectively accessed by the owner of the data, ideally through an API.

A remarkable point is that some clients are owners of tens of record per day, while other clients can have tens of thousands of records per day. And of course the response time must be always acceptable.

The plan

As often happens in the IT world, ideally cost and complexity should be kept as lower as possible, while at the same time, performance and scalability should unexplainably tend to infinite.

Our plan was to find a compromise for those requirements using Iceberg on S3 as a data store, and then exposing the data through an API with mandatory parameters to force the clients to trigger data queries that leverage the partitioning columns at every request.

To validate our plan, we started a POC that could then easily be converted in a production feature.

It was our first experience with Iceberg, so we needed to learn by doing.

Create a new Iceberg table

We created a new AWS Glue table on an S3 bucket, configured to use Apache Iceberg as Table Format, and Parquet for File Format.

We configured the partitioning on 2 data columns:

the company that owns the entity
the calendar day from the record timestamp

Feeding the table with data

We implemented a writing process this way:

Data is produced in the platform, through many decoupled different pipelines
A Flink job produces the aggregate that represents our final denormalized record
Records are saved into the S3 Iceberg table defined in AWS Glue

The PoC

We built and deployed the artifacts, and at every data flush performed by Flink we started seeing new metadata files like these:

metadata/<version>-<UUID>.metadata.json
metadata/snap-<number>-<UUID>.metadata.json
metadata/<UUID>.metadata.json

and a lot of new new data files:

data/<randomstring>/companyId=123/timestamp_day=YYYY-MM-DD/<randomfilename>.parquet

Everything seemed to be working as expected.
The query pattern used by our clients, based on company and date, could leverage the partitioning and gave extraordinary good performance while being very cost-effective.

For example, launching a query with Athena we could get very good reading perfomance, and Athena scanned only an incredibly small amount of data, in proportion to the whole table size.

An expected problem

We expected that this first release would have had a problem due to our feeding strategy. In fact, after a few days, while the table was growing, we started seeing a little degradation in performance.

Even if the size of the scanned data was still very low, the performance was getting worse.

Our writing process was configured to write the data on average every 5 minutes, and the partitioning was per day.
As a consequence we had a lot of small files, and a lot of metadata references to them.

After a few days feeding the table, the size of metadata had grown so much that it was about 90% of the data folder itself.
The analysis of the metadata became far more complex and costly, even if the number of real data files that were eventually read was very small.

The problem was that we were continuously adding data, and with this basic configuration, Iceberg was creating a new snapshot for each data insertion, and was retaining all of them.

In addition to that, the data files were very small and fragmented, because a new file was added at every insert transaction.

The easy (but just partial) solution

We replaced the table with a new one, then we enabled the compaction managed function on the Glue table, confident that this would have had resolved the problem.

After a few days we queried Athena about the size of the metadata, calling this query:

SELECT count(*) FROM "dbname"."tablename$manifests"

The number of manifests was actually smaller than before, and the performance was good and not giving any sign of degradation.

So far so good.

But then we made a double check launching from the AWS console a "Calculate total size" operation on the metadata and data folders.

Interestingly, the S3 folders were not smaller than before but were far bigger. The increase on metadata folder was a scary 500% in number of files, and 150% in size.

Of course we were missing something, and the cost of the S3 storage would have been unsustainable.

What we missed

The problem was an oversight reading the documentation, because the managed compaction, by design, has just the responsibility to rewrite data files into a more optimized layout, based on their size and number.

In other words, if inside a partition there is a high number of small files (the exact number and size depends on config parameters) the compaction job rewrites them as a single data file, simplifying also the metadata references to that chunk of data. This is the only intended meaning of "compacted".

What we missed was that compacting the data doesn't imply that the orphan files are also deleted. This behavior is exactly as happens with some RDBMS, where deleting records doesn't mean that you are actually freeing disk space.

So, even if the metadata complexity was lower, the performance was good, and the compaction process was effectively optimizing the data reducing its fragmentation, the system was leaving behind on S3 a huge quantity of junk files.

The final solution

We ended up creating a Lambda function that continuously executes a VACUUM command leveraging Athena, so that all expired snapshots and all orphan files (both data and metadata) are physically removed from S3.

Here is a comparison of some key metrics collected while using the three different configurations described above:

Metric	Just write	+Compaction	+Compaction +Vacuum
Number of snapshots	~320/day	10	10
Number of metadata files per day	2,200	10,500	67
Size of metadata files per day	5,200 MB	7,900 MB	100 MB
Number of data files per day	~300K	~800K	50K
Size of data files per day	5,600 MB	8,500 MB	900 MB
Number of files referenced per day (excluding orphan files)	300K	50K	50K

The improvement and the savings in terms of disk usage is clear.

The downside is that compaction and vacuum operations performs a lot of API calls on S3, so it is necessary to tune the parameters in order to find a good balance that works for each use case, balancing cost of storage, cost of API calls, optimal file size for Iceberg to work good with your access pattern.

TLDR - Key takeaways

The road to mastering Apache Iceberg is long and hard for newbies, but it's definitely a game-changer technology.

Writing big batches of data is far easier than adding it little by little. If you have to do so, consider that you'll have to master the compaction and the vacuum processes.

I challenged Amazon Q with 3 cloud architecture questions

Claudio Taverna — Wed, 31 Jan 2024 01:50:37 +0000

And compared the result with Copilot and ChatGPT

After the release announcement of Amazon Q at re:Invent 2023, I immediately started to wonder if it could really be a useful tool to help me in my daily job as a cloud architect using AWS services.

Is it just a cool toy or a game-changer?

The expectation

My basic expectation is that Amazon Q can be a good assistant in the daily boring and time-consuming task of searching for information in the AWS documentation. After all, we are in 2024, we are talking about AWS, we are used to seeing and using generative AI tools, and the wow effect has (almost) gone. Less than this would be not enough.

I appreciate a lot the AWS idea of creating a “generative AI–powered assistant designed for work”, designed for real-world users and not just for demonstrations.
My objective is to understand if Q is something more than other general-purpose generative AI tools.

How I made the test

So, focusing on its generic ability to be an expert in AWS I made my tests using the Amazon Q assistant in the sidebar of the AWS management console. This is just one of the capabilities of Amazon Q, but is for sure the most immediate to interact with.

I started chatting with Q about a simple architectural design question, then I asked for help with a technical issue, and finally a cost evaluation.

As an additional quality check, I submitted each question also to Copilot and ChatGPT 3.5, and then I expressed my evaluation of the quality of each response.

This is the result of my far-from-analytical analysis.

Question 1 - A general architectural question

I asked a high-level architectural question, not too difficult but not too simple, with many possible right options.

I have a system with very unpredictable load.

I need to implement a feature to show simple polls to the users and collect their feedbacks.

Users should be able to express one vote in a 2 minutes time window.

The results should be hidden for the first minute and then they should see the results being updated in real time.

Can you suggest me a good architecture?

Amazon Q Answer

I apologize, but your request seems to be outside my domain of expertise. However, I'm happy to try discussing related topics that I may have more information on. How can I help further our conversation productively?

Okay, this approach for some unknown reason is not working.

May a more precise question sound better to Q?

I have to collect votes from the users of my platforms.

What services could be a good solution to ingest this data and at the same time be able to read at a high frequency?

Amazon Q Answer

Here are a few options on AWS that would be suitable for ingesting vote data at high frequency while being able to read the data:

Amazon Kinesis Data Streams - This service allows you to continuously capture and store terabytes of data per hour from hundreds of thousands of sources. It supports stream processing frameworks like Kinesis Client Library, Apache Storm, and Apache Spark Streaming. The ingested data can then be processed in real-time.

Amazon DynamoDB - Being a fully managed NoSQL database service, DynamoDB provides fast and predictable performance with seamless scalability. It can handle millions of requests per second and can scale to handle any workload. The vote data can be ingested into DynamoDB tables and queried for aggregation and analysis at high frequency.

Amazon ElastiCache for Redis - Using ElastiCache which provides in-memory data store solutions like Redis, the vote data can be written to Redis very quickly due to its in-memory speed. The data can then be aggregated and read in real-time for high frequency needs. ElastiCache also auto scales the cache capacity as needed.

This answer is pretty good.

Of course, it's just a starting point and it's necessary to deepen into many details.

But in my opinion, this is absolutely a good answer: it explores the problem of ingesting the votes, where to store the data, and how to read the data at high frequency. Amazon Q does not mention any computing service, but probably this is not a crucial point at this level of design.

Let's check what other competitors can do.

NB: Being outside AWS, in these cases I used identical wording but I added a short premise in the beginning "I have a project in AWS."

Copilot answer

Searching for: AWS services for collecting votes from users

If you’re looking for an AWS service to collect votes from users, you might want to consider using Amazon Simple Queue Service (SQS). SQS is a fully managed message queuing service that enables you to decouple and scale microservices, distributed systems, and serverless applications. It can be used to transmit any volume of data, at any level of throughput, without losing messages or requiring other services to be available. You can use SQS to collect votes from your users and then process them at a high frequency. SQS is also highly available and fault-tolerant, so you can rely on it to handle your data with minimal downtime

Another AWS service that could be useful for your use case is Amazon Kinesis Data Streams. Kinesis Data Streams is a fully managed service that enables you to build custom applications that process or analyze streaming data for specialized needs. It can be used to collect and process large streams of data records in real time. You can use Kinesis Data Streams to ingest votes from your users and then process them in real time. Kinesis Data Streams is also highly scalable and can handle millions of data records per second.

I hope this helps! Let me know if you have any other questions.

This answer is not completely wrong, but for sure it is not complete.

Copilot is shortening and simplifying the question, moving the focus only on the ingestion services, SQS or Kinesis.

ChatGPT answer

For collecting votes from users on your platform and handling high-frequency reads, several AWS services can be employed to create a robust and scalable solution:

Amazon API Gateway: Use API Gateway to create RESTful APIs that can receive and process incoming votes from your users. API Gateway can handle high-frequency requests and integrate seamlessly with other AWS services.

Amazon Kinesis: If you're dealing with a high volume of incoming data (votes), consider using Amazon Kinesis. Kinesis Data Streams can ingest and store data in real-time and scale elastically to accommodate varying loads.

Amazon DynamoDB: DynamoDB is a fully managed NoSQL database service that offers low-latency performance at any scale. You can use DynamoDB to store and retrieve the votes submitted by your users. Its flexible schema and automatic scaling capabilities make it ideal for handling high-frequency reads and writes.

Amazon S3: For storing large volumes of data or for backup purposes, you can utilize Amazon S3. S3 is highly scalable, durable, and secure, making it suitable for storing raw or processed data related to the votes.

AWS Lambda: AWS Lambda can be used to execute code in response to events such as incoming votes. You can trigger Lambda functions to process and analyze the votes, perform validation checks, and store the data in DynamoDB or S3.

Amazon CloudWatch: Monitor the performance of your AWS resources and set up alarms using Amazon CloudWatch. This service can help you track metrics such as API Gateway request rates, DynamoDB throughput, and Lambda function invocations.

By combining these AWS services, you can design a highly available, scalable, and resilient architecture for collecting and processing votes from users on your platform. Make sure to consider factors like data consistency, security, and cost optimization while designing your solution.

At first glance, this answer seems even better than the one given by Amazon Q. But if you pay attention to the details, you can see that ChatGPT is just aggregating generic service descriptions.

This is ok, and there is for sure a considerable value in giving a good selection of services that could be used in this architecture.
However, in the response given by Amazon Q, each service description is customized based on the specific use case of vote ingestion. I don't know if this is just a cosmetic difference, but it seems to me an indicator of a deeper "understanding" of the problem.

In my opinion, Amazon Q is the winner of this test.

These are the points I decided to assign:

AmazonQ: 3 points / Copilot: 0 points / ChatGPT: 2 points

Question 2 - A technical issue

For my experiment, I want Q now to support me in understanding a particular behavior of a service.

In the last few days I can see an increase in the metric "Read throughput exceeded" of a Kinesis Data stream. What's the problem?

Amazon Q Answer

It looks like you need help with network connectivity issues. Amazon Q works with VPC Reachability Analyzer to provide an interactive generative AI experience for troubleshooting network connectivity issues. You can try [...]

No good.

I tried to rephrase the question in many ways but without success, always this dumb answer.

In the end, after a lot of tries, I got a meaningful answer using this format:

Explain me the Read throughput exceeded metric exposed by Kinesis Data Stream

Amazon Q Answer

The "Read throughput exceeded" metric indicates that the read capacity limits of the Kinesis data stream shard have been exceeded.

Some potential reasons for this to occur are:

The consumer application is not able to read data from the stream as fast as records are being produced. Each Kinesis shard can support a maximum of 2MB/s of read throughput on average.

Multiple consumer applications are reading from the same shard and their combined read throughput exceeds 2MB/s.

The GetRecords API call that the consumer uses to read data is retrieving too many records at once (over 10,000 records or 10MB of data) from the shard. This can throttle subsequent reads for 5 seconds.

The processing of each record within the consumer application is taking too long, causing a backlog to build up.

To resolve this, the consumer application should check if it can optimize record processing to keep pace with the incoming data rate. Other options are adding more shards to scale up throughput, or distributing reads across applications/shards in a way that balances utilization. Referring to the AWS documentation on Kinesis data stream quotas and limits would also help determine the appropriate configuration.

I think that Amazon Q is giving a barely sufficient answer.

The reason 1 and 4 are both wrong, we have read throttling, not a high iterator age: the consumer is too fast, not too slow.

Copilot Answer

The “Read throughput exceeded” metric of a Kinesis Data stream indicates that the number of read operations on the stream has exceeded the provisioned read capacity. This can happen when the number of consumers reading from the stream increases, or when the size of the data being read increases. To identify the root cause of the issue, you can monitor the Amazon Kinesis Data Streams Service with Amazon CloudWatch.

Reviewing the following metrics in CloudWatch can help you identify the root cause of the error:

ReadProvisionedThroughputExceeded: This metric indicates that the number of read operations on the stream has exceeded the provisioned read capacity.

GetRecords.Bytes: This metric indicates the number of bytes read from the stream.

GetRecords.Latency: This metric indicates the time taken to read data from the stream.

Once you have identified the root cause of the issue, you can take appropriate actions to resolve it. For example, you can increase the provisioned read capacity of the stream, or reduce the number of consumers reading from the stream. I hope this helps!

I only copied here the text from the Copilot answer, but the output was full of good links to various sources, official AWS documentation pages and other sites.

Good job from Copilot, concise but relevant.

ChatGPT Answer

An increase in the "Read throughput exceeded" metric for a Kinesis Data stream typically indicates that your application is attempting to read more data from the stream than what it can handle. This could be due to various reasons:

Increased Traffic: A sudden surge in incoming data to the Kinesis Data stream might be causing the read throughput to exceed its provisioned capacity.

Inefficient Consumer Code: If your consumer application is inefficient or experiencing bottlenecks, it may not be able to keep up with the rate at which data is being written to the stream.

Scaling Issues: If your application is not properly scaled to handle the increased load, it may struggle to process the incoming data from the stream efficiently.

Misconfigured Shard Count: If the shard count for your Kinesis Data stream is not properly configured to handle the incoming data rate, you may encounter throughput issues.

To address the problem:

Monitor Metrics: Continuously monitor the metrics for your Kinesis Data stream to understand the patterns and identify any anomalies.

Increase Shard Count: If the increased throughput is sustained and not just a temporary spike, consider increasing the number of shards in your Kinesis Data stream to handle the higher load.

Optimize Consumer Code: Review and optimize your consumer application code to ensure it can efficiently process the data from the stream.

Implement Backpressure Mechanisms: Implement backpressure mechanisms in your consumer application to regulate the rate at which data is read from the stream, allowing it to keep up with the incoming data.

Consider AWS Lambda: If applicable, consider using AWS Lambda functions to process data from the Kinesis Data stream. AWS Lambda can automatically scale to handle varying workloads.

By addressing these factors, you can effectively manage the increased read throughput on your Kinesis Data stream and ensure the smooth operation of your AWS project.

This answer seems again very good at first glance, but if you go deeper you can find the same errors that had Amazon Q. The reasons 2, 3 and 4 are wrong, and so are most of the suggested solutions.

The second round goes to Copilot.

AmazonQ: 1 point / Copilot: 3 points / ChatGPT: 1 point

Question 3 - A cost evaluation

I have an S3 bucket where kinesis firehose is writing 1,000 files every day, each one weighting on average 10 MB.

The probability to have the need of accessing the data is very low, and the data should be retained for 2 years.

How much could I save each month switching from Stardard tier to Glacier Deep Archive tier?

Amazon Q Answer

Based on the information provided, here is how you could estimate the potential savings:
Kinesis Data Firehose is writing 1000 files to your S3 bucket every day. Each file is approximately 10MB in size.

So daily incoming data is 1000 files x 10MB per file = 10GB per day

Monthly incoming data is approximately 10GB x 30 days = 300GB per month

Since the data needs to be retained for 2 years, you could consider archiving it to Glacier Deep Archive tier.

Standard tier pricing is $0.023 per GB per month

Glacier Deep Archive tier pricing is $0.00099 per GB per month

To calculate estimated monthly savings:

Monthly data stored in Standard tier would cost 300GB x $0.023 = $6.90

Monthly data stored in Glacier Deep Archive tier would cost 300GB x $0.00099 = $0.30

Estimated monthly savings by switching to Glacier Deep Archive tier = $6.90 - $0.30 = $6.60

For the most accurate pricing information, you should check the AWS documentation or pricing pages.

The calculation is not completely wrong, but Q is not considering the data accumulation.

The first month the storage needed is 300 GB, but the second month it will be 600 GB, the third 900 GB and so on, up to the 24th month, when it will stop growing at about 7.2 TB. At this point, the total storage will be constant, and each month the oldest data will be replaced by the new.

When fully operational, the AWS cost calculator, configured with 7.2 TB and 30.000 monthly PUT requests / Lifecycle Transitions, gives this estimation:

Monthly cost of ~169$ for S3 Standard
Monthly cost of ~11$ for S3 Glacier Deep Archive

The monthly saving after 2 years will be something around 150$.

In this scenario, it would probably have been more useful to ask something simpler like: "Give me a comparison between the cost of S3 standard tier and glacier deep archive".
For the purpose of this test it's interesting to point out that the answer from Amazon Q is good but not great.

Copilot Answer

Searching for: AWS S3 bucket kinesis firehose 1000 files 10 MB 2 years cost comparison

Based on the information you provided, you could save a significant amount of money by switching from the Standard tier to the Glacier Deep Archive tier. The Standard tier is designed for frequent access to data, while the Glacier Deep Archive tier is intended for long-term data retention with infrequent access.

The cost of storing 1,000 files per day, each weighing 10 MB, for 2 years in the Standard tier would be approximately $6.912.
However, if you switch to the Glacier Deep Archive tier, the cost would be reduced to approximately $1.152.

This means that you could save around $5.760 per month by switching to the Glacier Deep Archive tier.

Please note that the cost estimates are based on the current AWS pricing as of January 2024 and are subject to change. Also, keep in mind that the retrieval time for data stored in the Glacier Deep Archive tier is longer than that of the Standard tier. If you need to access your data frequently, the Standard tier might be a better option for you.

I hope this helps! Let me know if you have any other questions.

Same problem shown by Q, but without any useful detail about the calculations that have been made to get the evaluation.

Not sufficient.

ChatGPT Answer

To calculate the potential savings from switching from the Standard tier to the Glacier Deep Archive tier for your S3 storage, we need to consider the difference in pricing between the two storage classes.

Here are the main differences in pricing between Standard S3 storage and Glacier Deep Archive:

Standard S3 Storage Pricing: Standard S3 storage has a higher storage cost compared to Glacier Deep Archive. As of my last update, the standard S3 storage cost was around $0.023 per GB-month for the first 50 TB tier.

Glacier Deep Archive Pricing: Glacier Deep Archive is significantly cheaper but incurs additional costs for data retrieval. As of my last update, the storage cost for Glacier Deep Archive was around $0.00099 per GB-month.

Given that you have 1,000 files per day, each averaging 10 MB, let's calculate the monthly storage usage:
Each file: 10 MB = 0.01 GB

Total daily storage: 1,000 files * 0.01 GB = 10 GB
Monthly storage: 10 GB * 30 days = 300 GB

Now, let's calculate the monthly storage cost for both tiers:

Standard S3 Storage Cost: Monthly cost = 300 GB * $0.023 per GB = $6.90

Glacier Deep Archive Storage Cost: Monthly cost = 300 GB * $0.00099 per GB = $0.297

Now, let's calculate the potential monthly savings:

Savings = Standard S3 Storage Cost - Glacier Deep Archive
Storage Cost Savings = $6.90 - $0.297 = $6.603

So, by switching from the Standard tier to the Glacier Deep Archive tier, you could potentially save around $6.603 per month on storage costs.

Please note that this calculation does not include additional costs for data retrieval from Glacier Deep Archive, which would be incurred if you need to access the data. Make sure to consider those costs if you anticipate needing to retrieve data from Glacier Deep Archive.

The result is very similar to the one from Amazon Q.
So these are my votes for the third test:
AmazonQ: 2 points / Copilot: 1 point / ChatGPT: 2 points

Conclusion

This is my final personal ranking:

Test	Amazon Q	Copilot	ChatGPT
Architecture question	3	0	2
Technical issue	1	3	1
Cost evaluation	2	1	2
	6	4	5

In my opinion, the winner is Amazon Q.

It has been fun to make this experiment, and the result is not far from what I was expecting from the test.

My experience with Amazon Q as a companion while architecting on AWS is absolutely positive. Like all other generative AI tools it’s far from being perfect but it can be an invaluable ally in my everyday work as a software architect.

I am sure that in the future it will be improved and will become smarter, but today it is already a useful tool to retrieve information, validate options and get good suggestions.

Introducing Mr-Know-It-All-O-Matic

Claudio Taverna — Fri, 17 Nov 2023 00:17:34 +0000

An AI-driven dissenter built with AWS PartyRock

What's PartyRock?

PartyRock is a space where you can build AI-generated apps in a playground powered by Amazon Bedrock. You can use it to experiment and create applications at the speed of light.

Any attempt to explain how this application works would probably take longer than simply opening it and start playing. So I won’t spend my and your time explaining something that basically doesn’t need an explanation. And, of course, there is the official guide for that.

PartyRock is not a tool for technicians or generative AI experts. Even those who have never written a single line of code can create and deploy a web application with AI capabilities in minutes. In this case, the expression “in minutes”, it’s not just a slogan, but it's real: I created my first online and working generative AI application literally in 10 minutes.

Mr-Know-It-All-O-Matic

While driving the car and thinking about an idea for a test application, I started reasoning on the hundreds of implementations already seen in the last months. What do have in common almost every generative AI application? They are repetitive, a little bit opinionated, pedant, and with a slightly tedious know-it-all approach.

So, what about taking these defects and transforming them into a strength?

Here we are.

I am proud to present to the world, the great and revolutionary Mr-Know-It-All-O-Matic.

Whatever is your job, whoever you talk to, your ideas will be questioned.
If you think that an idea that you will present in a slide is a little bit weak, probably you are right.
If you think that a friend or anyone else will be hard to convince about something, be sure that the discussion will be long and passionate.

Option one

You are the good. You are wise and every day you strive to be a better person. This tool will support you in thinking about the possible weaknesses of your argument and improve it.

Option two

You are a bad person, and your only goal in the life is to annoy people with irritating objections.
This tool will be an invaluable ally in your infinite disputes.

The choice is yours.

Some examples

"Swimming is a healthy activity"

Here is a condensed list of potential objections to the claim that swimming is a healthy activity:

Can be dangerous if proper precautions are not taken (e.g. risk of drowning, especially for non-swimmers)
Risk of injury if not done with proper form/technique
Risk of illness from swimming in contaminated water
Overexertion or strain on joints/muscles if swum excessively
Requires access to a pool or body of water suitable for swimming
Time commitment to travel to a pool and change clothes can be a barrier
Chlorine and other pool chemicals can irritate eyes, skin and lungs
Can be costly if pool membership or swimming lessons are required
Weather constraints for outdoor swimming
Social/body image concerns about wearing a swimsuit
Not a practical form of exercise for some people with disabilities or medical conditions

"The fire is hot"

Here is a condensed list of potential objections to the rationale that "the fire is hot":

Fires can have varying temperatures, so not all fires are necessarily hot. A small, dying fire may not be hot.
"Hot" is subjective and depends on one's heat tolerance. What's hot to one person may not be considered hot by someone else.
The sensation of heat depends on how close one is to the fire. A fire may not feel hot from a distance.
Different parts of a fire may be hotter than others. The center may be hotter than the edges, for example.
"Hot" is a relative term. Compared to an oven, a campfire may not be considered hot.
Certain types of fires, like alcohol fires, burn cooler than wood fires. So those types of fires may not be classified as hot.
The rationale assumes all fires produce high temperatures, though in reality fire temperature varies greatly.
Simply saying "the fire is hot" is too vague. More specifics are needed to properly evaluate the claim.

"The addition of two whole numbers results in the total amount or sum of those values combined."

Here is a condensed list of potential objections to the rationale that the addition of two whole numbers results in the total amount or sum of those values combined:

It assumes all numbers being added are whole numbers, but does not account for fractions or decimals.
It does not specify that the numbers need to be positive - adding a positive and negative number does not result in their total amount.
It assumes a simple addition of two numbers, but does not address more complex operations like adding three or more numbers.
It does not define what a "whole number" is. Integers? Non-negative integers?
It describes the basic principle of addition but does not provide any deeper mathematical foundation or proof.
It does not specify the mathematical system - addition in abstract algebra has more nuance than simple whole number addition.
The words "total amount" and "sum" are vague and informal mathematical terms. More precise language is needed in a formal proof.
It does not address potential limits like overflow when adding very large numbers.

How to run Jekyll on Kubernetes

Claudio Taverna — Mon, 02 Oct 2023 15:40:22 +0000

A step-by-step guide executing Jekyll inside a local k8s cluster

I created my blog using Jekyll, a great open-source tool that can transform your markdown content into a simple, old-fashioned-but-trendy, static site.

What are the advantages of this approach?

The site is super-light, super-fast, super-secure and SEO-friendly. Of course, it’s not always the best solution, but for some use cases, like a simple personal blog, it’s really a good option.

Aim of this guide

Running Jekyll locally can be a little bit tricky, at least for me, as I’m not very comfortable with Ruby.
So I decided to go for a containerized solution using Rancher Desktop.

This guide can be a good starting point to familiarize both with Jekyll and with Kubernetes. If you already know something about Kubernetes but you have never used it before, it could be an effective hands-on experience.

The idea is to make it easy to manage your blog using this pretty simple flow:

Write your blog posts locally as a bunch of simple markdown text files
Jekyll generates in real time the static site and serves it locally
When you are satisfied with the result, just commit and push the changes
An automated workflow will update the online "production" static website

The workflow is now up and running, and I am very happy with it, but I had some troubles making it work.
I’m on a Mac with macOS Sonoma 14.0.
So, here is the step-by-step guide to get it working.

Step 1 - Install Rancher Desktop

Nothing special to say here. I used brew, but you can of course install it as you prefer.

brew install rancher

I configured Rancher Desktop to user dockerd as a container engine because I am more familiar with Docker.

Step 2 - Check that everything is ok

Check with kubectl that your local k8s cluster has been configured properly.

johndoe@macbook ~ % kubectl config get-contexts

CURRENT   NAME                CLUSTER            AUTHINFO          NAMESPACE
*         rancher-desktop     rancher-desktop    rancher-desktop

Check that you container engine can download images and execute them.

johndoe@macbook ~ % docker run --rm hello-world

Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
70f5ac315c5a: Pull complete
Digest: sha256:4f53e2564790c8e7856ec08e384732aa38dc43c52f02952483e3f003afbf23db
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (arm64v8)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

During this step I had some network troubles with docker daemon pulling images.

It was needed to add some rules to my firewall to make Docker able to download them properly.

If you have an output like the one above, you can go on.

Step 3 - Create the environment for your Jekyll site

Main folder

Create a folder on your local machine. It will be the root folder of your git repo.

For example: /Users/johndoe/myblog

Kubernetes manifests' folder

Create another folder for your YAML files.
For example: /Users/johndoe/myblog/k8s

This is the folder where we will place our .yaml files and execute kubectl commands
I created it inside the main folder, but it would be better to use a separate folder, because it's not a good idea to push these file inside your future blog public repository. Remember to move it outside before your first commit.

Step 4 - Kubernetes entities

Create the namespace

Create a new file named namespace.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: jekyll

and apply it:

johndoe@macbook k8s % kubectl apply -f namespace.yaml

namespace/jekyll created

Create PV and PVC

Now we can create a persistent volume and a persistent volume claim, to map a local folder into the container that will execute Jekyll.

Create a new file named volume.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: jekyll-pv
  namespace: jekyll
  labels:
    type: local
spec:
  storageClassName: hostpath
  capacity:
    storage: 256Mi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: /Users/johndoe/myblog
  persistentVolumeReclaimPolicy: Retain
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: jekyll-pvc
  namespace: jekyll
spec:
  storageClassName: hostpath
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 256Mi

and apply it:

johndoe@macbook k8s % kubectl apply -f volume.yaml

persistentvolume/jekyll-pv created
persistentvolumeclaim/jekyll-pvc created

Scaffold a new Jekyll site

In this step we will create a job to execute jekyll command to create a brand new site.

The docker image used is the official one: https://hub.docker.com/r/jekyll/jekyll/

Create a new file named job-create-jekyll.yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: jekyll-create
  namespace: jekyll
spec:
  template:
    spec:
      containers:
      - name: jekyll-create-job
        image: jekyll/jekyll:latest
        volumeMounts:
          - name: jekyll-volume
            mountPath: /srv/jekyll
        command: ["/bin/sh"]
        args: ["-c", "jekyll new docs"]
      restartPolicy: Never
      volumes: 
        - name: jekyll-volume
          persistentVolumeClaim:
            claimName: jekyll-pvc
  backoffLimit: 1

and apply it:

johndoe@macbook k8s % kubectl apply -f job-create-jekyll.yaml

job.batch/jekyll-create created

If everything works fine, the job should log something like this:

Running bundle install in /srv/jekyll/docs...

NB: to get k8s logs you can use kubectl, the kubernetes dashboard, k9s, Lens, Rancher... if you don't know how to do it, take your time to get confidence with that.

Check your local folder, now you should find a new docs subfolder with a brand new Jekyll site.

If you like having a clean k8s cluster, it's time to delete the job before going on.

Step 5 - Jekyll deployment

Update the PV path

Now that the site is ready to be served, we need to change the path of the persistent volume, because the folder that should be mounted in /srv/jekyll is the docs folder that has just been created.

Edit the file volume.yaml and add /docs at the end of the path.

[...]
  hostPath:
    path: /Users/johndoe/myblog/docs
[...]

Using your preferred Kubernetes management delete the volume and the claim.

For example with:

kubectl delete pvc jekyll-pvc -n jekyll

kubectl delete pv jekyll-pv -n jekyll

And apply it again:

johndoe@macbook k8s % kubectl apply -f volume.yaml

persistentvolume/jekyll-pv created
persistentvolumeclaim/jekyll-pvc created

Create the deployment

Create a new file named deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jekyll-preview
  namespace: jekyll
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jekyll-preview
  template:
    metadata:
      labels:
        app: jekyll-preview  
    spec:
      containers:
      - name: website
        image: jekyll/jekyll:latest
        ports:
          - containerPort: 8080
          - containerPort: 8081
        volumeMounts:
          - name: jekyll-volume
            mountPath: /srv/jekyll
        env:
        - name: JEKYLL_ENV
          value: "production"
        command: ["/bin/sh"]
        args: ["-c", "jekyll serve --trace --watch --force_polling --port 8080 --livereload --livereload-port 8081"]
      volumes: 
        - name: jekyll-volume
          persistentVolumeClaim:
            claimName: jekyll-pvc-2

and apply it:

johndoe@macbook k8s % kubectl apply -f deployment.yaml

deployment.apps/jekyll-preview created

Step 6 - Fix problems, if any

Possible problem #1

When the pod started the first time I got this error message:

 /usr/gem/gems/jekyll-4.2.2/lib/jekyll/commands/serve/servlet.rb:3:in `require': cannot load such file -- webrick (LoadError)
 from /usr/gem/gems/jekyll-4.2.2/lib/jekyll/commands/serve/servlet.rb:3:in `<top (required)>'
 from /usr/gem/gems/jekyll-4.2.2/lib/jekyll/commands/serve.rb:179:in `require_relative'
 from /usr/gem/gems/jekyll-4.2.2/lib/jekyll/commands/serve.rb:179:in `setup'
 from /usr/gem/gems/jekyll-4.2.2/lib/jekyll/commands/serve.rb:100:in `process'
 from /usr/gem/gems/jekyll-4.2.2/lib/jekyll/command.rb:91:in `block in process_with_graceful_fail'
 from /usr/gem/gems/jekyll-4.2.2/lib/jekyll/command.rb:91:in `each'
 from /usr/gem/gems/jekyll-4.2.2/lib/jekyll/command.rb:91:in `process_with_graceful_fail'
 from /usr/gem/gems/jekyll-4.2.2/lib/jekyll/commands/serve.rb:86:in `block (2 levels) in init_with_program'
 from /usr/gem/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `block in execute'
 from /usr/gem/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `each'
 from /usr/gem/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `execute' 
 from /usr/gem/gems/mercenary-0.4.0/lib/mercenary/program.rb:44:in `go' 
 from /usr/gem/gems/mercenary-0.4.0/lib/mercenary.rb:21:in `program' 
 from /usr/gem/gems/jekyll-4.2.2/exe/jekyll:15:in `<top (required)>' 
 from /usr/gem/bin/jekyll:25:in `load' 
 from /usr/gem/bin/jekyll:25:in `<main>'

It's an error related with this issue: https://github.com/jekyll/jekyll/issues/8523

It can be resolved by simply adding gem "webrick" at the bottom of the Gemfile located in the root folder of the site:

[...]
# Lock `http_parser.rb` gem to `v0.6.x` on JRuby builds since newer versions of the gem
# do not have a Java counterpart.
gem "http_parser.rb", "~> 0.6.0", :platforms => [:jruby]

gem "webrick"

And of course, you then have to delete the pod to trigger k8s to restart a new one.

Possible problem #2

After overcoming the previous problem, the pod started logging this error message:

[...]
chown: .jekyll-cache/Jekyll/Cache: Permission denied
chown: .jekyll-cache/Jekyll: Permission deniedchown: .jekyll-cache/Jekyll: Permission denied
chown: .jekyll-cache: Permission denied
chown: .jekyll-cache: Permission denied
[...]

I'm pretty sure that there could be a cleaner solution, but I resolved the issue with a rough but effective manual deletion of two folders before executing jekyll:

args: ["-c", "echo Cleaning temporary files...; rm -r _site; rm -r .jekyll-cache; echo ---Done--- && jekyll serve --trace --watch --force_polling --port 8080 --livereload --livereload-port 8081"]

Step 7 - Create the service or the port forward

Now you should finally have this output:

To make a quick check if the site is finally working, it is now possible to create a port forward from the pod to local host:

kubectl port-forward jekyll-preview-xyz123xyz-xyz12 8080:8080

A better solution is, of course, creating a service, which will be a more stable solution to reach your deployment.

Create the service

Create a new file named service.yaml

apiVersion: v1
kind: Service
metadata:
  name: jekyll-svc
  namespace: jekyll
spec:
  selector:
    app: jekyll-preview
  ports:
  - name: http 
    protocol: TCP
    port: 8080
  - name: http-livereload
    protocol: TCP
    port: 8081
  type: LoadBalancer

and apply it:

johndoe@macbook k8s % kubectl apply -f service.yaml

service/jekyll-svc created

You can now point your browser to http://localhost:8080 and here we are!

Enjoy livereload

It's time to try the complete solution, with the cool live-preview feature, from the original markdown file to the local online static site.

Edit a file, for example the sample post that Jekyll has created for you, and try to change something.

Save the file and, after a few seconds, you should see your browser automatically refresh the page and show your change.

I find that this is perfect if you work with two screens, editing the markdown in the first screen while checking the final result on the second one.

Build your site

Now that you have a local working deployment of Jekyll, you can play with it and build your site. You can add themes, change styles and add contents.

When you are satisfied with the result, you are ready to put it online, accessible to everybody.

Step 8 - Go online

Being a simple static site you have now many options to expose the site on the public web, and the interesting point is that you don't need costly computing resources.

Option A - Deploy on Amazon S3

If you are familiar with AWS, this is definitely one of the simplest options to put your site online.

Create a new S3 bucket, disabling the option Block all public access

Upload all the content of the _site folder into the bucket.
Open the Properties tab and enable the Static website content feature.

Confirm the index document filename as index.html and save.
Open the Permissions tab and edit the bucket policy as follows (use your bucket name in the Resource field):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "PublicReadGetObject",
            "Effect": "Allow",
            "Principal": "*",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::johndoe-blog/*"
        }
    ]
}

Now you can point your browser to the bucket web address:

http://johndoe-blog.s3-website.eu-north-1.amazonaws.com.

(You can find it at the bottom of the Properties tab)
The last action is to create or update a CNAME record in your DNS provider configuration console, in order to map your custom domain name to the S3 website url:

johndoe.com CNAME my-awesome-site.com.s3-website-us-east-1.amazonaws.com

To automate the deploy there are many options.

The simpler one is probably to use the AWS CLI with the command aws s3 sync. In this case you will simply sync your output folder _site with the bucket root.
A cleaner solution could be using CodeCommit and CodePipeline to automate the S3 bucket update when you push your changes to your git repository. You can follow this tutorial for more information about this procedure.

Option B - GitHub pages

Another interesting option is using GitHub + GitHub Pages + GitHub Actions.

In this case you can leverage this interesting workflow:

Put all your source code in a repository named <your-github-username>.github.io
Commit and push your changes
GitHub Action will do all the magic
Point your browser to https://<your-github-username>.github.io and enjoy

If you want to try GitHub Actions this could be a good way to start familiarizing with it.

An important point to highlight, that can be considered good or bad depending on the point of view, is that in this case the static site will be generated by another instance of Jekyll managed by GitHub, so you will have to check that everything is fine and corresponds to the output of you local Jekyll instance.

Conclusion

I hope that this guide can be useful to someone, for me it has been an interesting learn-by-doing experience.

If you see errors or you have any suggestions to handle the process in a better or cleaner way, I will be happy to integrate it and improve the solution.

Study Guide for AWS Associate Level Certifications - Part 1

Claudio Taverna — Sun, 09 Jul 2023 13:45:12 +0000

My recap notes for the 3 certification exams

While studying for the 3 AWS Associate level certifications I started taking notes in markdown format.

This is because to remember things I have to write them down, producing short summaries of the information I am reading. And this is the only method that works for me.

I later realized that this recap content, in addition to being a support to me as a refresher before the exams, might also be useful to others, so I decided to publish it on my blog.
I think and hope that it could be used as an additional study source, as a refresher, or even as a super-fast introduction to some AWS topics that you are approaching.

I will release the content gradually over the coming weeks, after a small review.

The content of the three exams is different, but there is a lot of overlap, and a general overview can be useful in any case, if you are studying for AWS Certified Developer - Associate, AWS Certified Developer - Associate or AWS Certified SysOps Administrator - Associate.
In the end, I will also share a table with a matrix to better understand what are the services on which each exam is focused.

This is the content that comes with Part 1, released today:

Introduction
Concepts
Networking
- VPC
- Route 53
- ELB
- Cloudfront
Management
Security, Identity and compliance

In the next part I will share my notes about computing, databases, storage, developer tools, and many more.

I will also release a collection of the most interesting sample exam questions that I have come across while studying and practicing.

Please let me know if you find this useful in any way or if you find errors or inaccuracies in the content.

How to deal with 40 million documents per day with AWS OpenSearch

Claudio Taverna — Mon, 03 Jul 2023 14:28:02 +0000

An approach to compress data without losing (almost) any data

Context

This story is about an IoT project, hosted on AWS, ingesting data produced by some devices.
The system is a distributed architecture, where after the ingestion stage, the raw data pass through an Amazon Kinesis Data Stream.

This stream is the source for a bunch of services, each one responsible for different processing that we need to perform on the data.

Kinesis is a key point for the system, because it allows each consumer to read events and process them at its speed, making each processing pipeline independent from the others. Kinesis can also absorb all traffic peaks and allows the ingestion to continue when one or more processing pipelines are temporarily disabled, for any reason.

The cross-checking need

During the project's evolution, it became important to validate the processings performed on the data stream. The idea was to find a way to save somewhere the raw data, so that it could become possible to cross-check the result, make quick troubleshooting, and validate the processing.

When the quantity of data is huge you cannot simply save it on a relational database and query it in all possible ways, present and future. But OpenSearch (and ElasticSearch too, of course) can do it pretty well, and is also a great tool to query and filter data, make advanced aggregations, and visualize the results, literally in seconds, without the need for preliminary data preparation.

So we added a sort of a sniffer component, that started to consume events from Kinesis and to save them into OpenSearch, after a quick enrichment/denormalization process.

The result was an index with all the data from all devices, and OpenSearch has proven to be a great tool for our troubleshooting needs.

Performance and cost considerations

OpenSearch is very good with big quantities of data, but in our scenario each document is about 2KB and we have to store 40 million documents every day.

We started using daily indexes. From a performance perspective, OpenSearch worked very well, but each index was about 22 Gigabytes, which means something like 8 Terabytes per year of data. A bit too much for a service that after all is not strictly part of the system, but is something like a "data observability tool".

The cost started soon to be too high, and the first option that we saw was to delete old data and keep just the head, for example, the last two or three months.

It was for sure an option, but sometimes it could have been interesting to execute queries on long-term data, for example just the data from a single device, but over a long period of time, for example one year, or even more.

A better alternative was finding a way to aggregate the data, reduce the data size and so keep more data by paying a similar cost.

Rollup jobs

The cleaner way seemed to be using Index rollups, a feature that automatically reduces data granularity by making metric aggregations, in a completely transparent way.

After some tests, we noticed that with our data we had some errors. Probably it was related to some problem in our data, or may be there was a bug because OpenSearch was relatively young. We didn't discover the reason behind those failures, we had no useful logs and no time to spend on that investigation.

In addition to that, this solution is limited in the types of aggregations, because rollups jobs can perform only All, Min, Max, Sum, Avg, and Value Count.

In our scenario it could be useful to us more freedom in the aggregation logic, for example, collecting the distribution of a metric.

So we explored the option of using index transform, and through this feature we finally reached our goal.

Transform job

The solution is based on the idea of having 2 different sets of data, for different needs:

A sliding window of uncompressed raw data (in our case configured to be the last 2 months)
A historical set of compressed data, going into the past as long as possible

Shortly, this is how the solution works:

An EventBridge rule is scheduled to be executed every day and start a Lambda function
The lambda creates a monthly index and its mapping for the compressed data (in OpenSearch these operations are idempotent)
The lambda deletes the daily indexes of uncompressed data older than the configured sliding window (in our case this means that is deleted the 61st daily index before execution time)
The lambda creates the daily transform job, for immediate execution
The transform job is executed, and works on the data of the previous day
The data is aggregated in blocks of 1 hour, per single device
Some metrics are aggregated using the available standard aggregations (min, max, average and so on)
Other metrics are processed using custom scripted_metric aggregations, and leveraging the flexibility of map/combine/reduce scripting, in our case the data is reduced into custom distribution reports

Advantages and disadvantages

According to the investigation we need to take, we can decide whether it is better to use the most recent raw data as-is, or the historical compressed data.

If, for example, the need is to check what exactly a specific device has sent the last week we can use the daily indexes of the uncompressed raw data.

Instead, if the need is for example to study a trend during the last year, the compressed monthly indexes are the right data source.

A negative aspect of this solution is that when we start working on a visualization we have to decide in advance on which of the two data sources we want to work. The schema is different so a visualization created on the daily index will not work on the monthly indexes and vice-versa.

It's not really a major problem, because usually you know well what source fits well on the specific need. Anyway, the index rollup feature is for sure better from this point of view, because the data source is the same and you don't have to handle this double data source.

Some quick calculations

Daily raw-data uncompressed indexes

Each JSON document is about 1.6 KB
Each index contains on average 40 million documents
The daily index size is on average 23 GB
One month of data is about 700 GB

Monthly aggregated indexes

Each JSON document is about 2.2 KB
Each index contains on average 15 million documents
The index monthly size is on average 6 GB

Final considerations

It's easy to see that the resulting compression ratio is roughly 100:1.

This means that, even configuring historical indexes for 1 replica and so doubling their size, the disk space needed to handle one month of raw data allows us to store more than 4 years of aggregated data.

In addition to that OpenSearch is much faster querying the aggregated indexes.

Before implementing this solution we had stored 1 year of raw data, but due to performance troubles, we had to scale out the OpenSearch cluster up to 6 nodes. This generated a considerable cost, both for computing and storage resources.

With this serverless solution that automates the data aggregation, we were able to reduce the cluster size to just 3 nodes, each one with smaller storage.

In other words, with a small and acceptable loss of data, today we spend less money while we can afford to retain all historical data.

Design for failure by using Dead Letter Queues (DLQ)

Claudio Taverna — Sat, 17 Jun 2023 08:19:07 +0000

What are dead letters and why they are essential in any distributed system

What is a dead letter

Let's start with a short definition, effective in the context of distributed systems and messaging platforms:

A dead letter is simply a container where messages that cannot be delivered are collected and stored.

This container is usually a queue, but it's not strictly mandatory, the concept is broader and not necessarily based on it.

This article is not focused on a specific broker or technology, but addresses the subject from a general perspective. Each system has its specificities, and the official documentation is definitely the way to go.

A short off-topic about the name

The term "dead letter" is due to an expression borrowed by the postal world.

There are strong analogies between real-world mail deliveries and how, in distributed systems, messages are delivered between software components.

In the real world, there are many reasons for a letter or a package to be marked as "undeliverable": both recipient and sender addresses are incorrect, the envelope is damaged and addresses are no more readable, both recipient and sender are no more available, the content is not compliant with postal regulations, and many others.

All postal companies in different countries have faced the problem of handling all undeliverable mail, usually with a dedicated office having the critical responsibility of opening private letters and packages, searching for clues about the recipient or the sender, to finally becoming able to deliver it or move to the extreme solution of destroying or auctioning the content.

If you like to waste some time reading fun facts, consider googling about the story and the statistics of Dead Letter Offices all around the world, you'll find many astonishing numbers and funny stories about strange contents, from alive rattlesnakes to smelly dead fishes, human skulls, bags full of money, drugs, and weapons.

Design for failure

Jumping back to the field of distributed systems, when a messaging system has the responsibility of the asynchronous communication between components, it's essential to provide a mechanism to handle the failures.

This is, in general, a piece of good advice. Like the Amazon CTO Werner Vogels has wisely said: "Everything fails, all the time".

In other words, the point is not IF something will fail, but only WHEN this will happen.

Hence we need to design systems so that they will continue to work, as much as possible, also during and after these foreseeable failures.

How to handle errors

Any asynchronous communication can potentially encounter different kinds of problems related to delivery:

The recipient is not available
The recipient explicitly refuses the message
The recipient is not giving an acknowledge to the message

The problem can be transient or persistent, and the consequent action should change accordingly.

But when a delivery failure happens, the broker between two components has only a few options:

Try again
Lock the queue
Discard the message and continue

Option 1: Retry

Retrying is a good option for transient errors, but it is useless, when not counterproductive, if the error is persistent.

Retrying should be done waiting some time between each attempt, through an exponential backoff algorithm. In this case, the retry process can take a long time and this results actually in a temporary lock, which can be unacceptable for high-load systems, as discussed in the next bullet point.

Also, even a transient error, if repeated many times, should be considered in practice a persistent error. An infinite loop is never a good idea, and at a certain point, in any case, it is advisable to stop trying if it is not working.

We can say that generally speaking retrying is good, but it's usually not enough.

Option 2: Lock

The option of locking the queue is mandatory when there is a strict constraint about message ordering, but fortunately, this is not the most common scenario. In most cases the distributed systems are designed to handle unordered messages, and poison messages can simply be put aside for retry or future investigation.

Locking can sometimes be a reasonable option, but in many cases, especially when there is a critical process with a huge quantity of data to be processed, it is simply not acceptable.

A locked queue means indeed that the queue is growing indefinitely, and this is not sustainable for a long time, and can bring other even worse infrastructure problems. In addition to that, the consumers are not allowed to proceed on a locked queue, and this means that someone (user or software) is not receiving the expected data, experiencing a malfunction or an unexpected delay.

Option 3: Discard and continue

And here we are to the third option: discard the message and continue to process the others. Here, "discard" doesn't mean to delete the message but means "remove from the queue and put it somewhere".

This is usually the best option because the consumer can forget the problematic message and continue to work, but the message is not lost and can be analyzed and/or recovered later on.

Dead letter queue vs other options

The general concept of dead lettering a message can be carried on in multiple ways.

Just to list a few, the poison messages could be simply saved into log files, it could be inserted in a database, it could be written to disk in a text file, or it could be saved into an object storage service like AWS S3 or Azure Blob Storage.

In some specific cases, one of these strategies might even be the best option, but using a dead letter queue has many advantages.

Faster retry and purging

A dead letter queue is, after all, a normal queue, with all the advantages offered by this kind of entity. For example, it's usually very easy to "forward" the messages from the DLQ to the original queue. If the consumer had a bug that meanwhile has been corrected, for example, we can simply re-enqueue the dead-lettered messages, and the new version of the consumer will be now able to process the "old" dead-lettered messages together with the ordinary flow of real-time messages.

On the contrary, if after an investigation we know that the messages are wrong and can be deleted, with one click we can purge the entire queue.

Observability for free

If you are using a message broker you likely already have a monitoring platform.

In this case, it will be almost immediate to activate observability principles also on the DLQ, with very little additional effort. No need to generate custom metrics in your code or analyze log files to understand and count errors, because the broker, if configured properly, can do it for you.

The rate of messages entering the DLQ, compared with the rate of the main queue, is a good metric for the health status of the process run by the consumers. The absolute quantity of messages in the dead letter is telling us how many failures we had since the last purge, and if the rate has a predictable trend or DLQ is usually empty, these metrics are also good choices for automatic alerts.

No custom code

Most brokers, when sending a message to the dead letter, add in the header of the message a field with the reason why this has happened. It may be a broker-generated error like "too many retries" or "message expired", or maybe a consumer-generated and broker-forwarded error like "explicitly nacked with this error message: unknown id".

The reason for dead-lettering is put inside the message itself, and this comes for free, without writing code, because it is implemented in the broker, not in the consumer software.

Not so good for analysis

The only little disadvantage that I experienced using DLQs, compared to other solutions like a database, is the fact that messages in a queue cannot be easily queried and counted, but can only be peeked one by one. In case of high volumes and considerable quantities of messages ending up in a dead letter, it can be a good idea to attach a consumer and move them into a data store where it is possible to analyze them statistically.

TLDR

Dead letter queues are definitely the best place to save error messages and poison messages. Monitoring is richer and easier, re-publishing and purging is super fast and the broker itself owns the responsibility of this process.

A practical overview on Architecture Decision Records (ADR)

Claudio Taverna — Fri, 26 May 2023 01:33:44 +0000

How to start and why this could be your most valuable action as a software architect

I have been using ADRs for some years.

In this article I'll try to give an introduction, share my feedback, and encourage you to start writing them if you already don't.

A little about the architecture decisional process

Taking architecture decisions is tough.

It can be the result of a bunch of technical and non-technical evaluations, made of tricky balances between technology, cost, time, and stakeholders' requests.

There is rarely one single, optimal solution. Instead, there are many reasonable solutions, no true/false, only better/worse options.

The final decision may seem like a small thing, but a lot of work may have been required to come to that decision.

For this reason, architectural decisions are usually considered "wicked problems".

Nevertheless, most teams come to a decision verbally during meetings, and those decisions are sooner or later lost or forgotten. And all that remains is just the state of things.

This is a shame, because in a nutshell, the real value of a decision is what is stated by the "Second Law of Software Architecture":

Why is more important than how.

If you are curious about the first law, I’d wholeheartedly recommend that you get and read the great book Fundamentals of Software Architecture written by Mark Richards & Neal Ford.

Before going deep into considerations and tips, it's useful to clarify the discussion by giving a quick definition:

{: .box-note}
An ADR is simply a concise document that explains

**why* something is going to be made this way*.

The structure of an ADR

This is a basic example of an ADR template, in markdown format, as suggested by Michael Nygard.

# Title

## Date

## Status
What is the status, such as proposed, accepted, rejected, deprecated, superseded, etc.?

## Context
What is the issue that we're seeing that is motivating this decision or change?

## Decision
What is the change that we're proposing and/or doing?

## Consequences
What becomes easier or more difficult to do because of this change?

There are a lot of other templates, with different levels of complexity, but in my opinion it is better to keep it short and easy. And Nygard's template is perfect, according to me.

If you want to see some real examples, you will find at the end of the article some links to pages with many detailed resources and examples.

Every decision should have a unique code, usually in the filename and in the title, resulting in something like:

ADR0001 - Use an SQS queue to manage outgoing emails

Every ADR should explain one single decision, not many of them.

It's frequent to write multiple ADRs after a long analysis, but each one should have its own code and its own lifecycle, because in the future some of them will change status, while others will stay in accepted state.

WHY are ADRs important?

Now that we know what's an ADR, it's important to understand in-depth the advantages of the adoption of this framework.

Tracking the knowledge

Architecture decisions are pieces of knowledge, among the most important, in my opinion.

People move between teams and companies, and it's important not to lose that knowledge when someone leaves the team.

On the other side, when a new member joins the team, they will face exactly the same lack of knowledge, and for sure they will ask themselves: "Why the hell was the system designed in this way?".

We developers are opinionated guys, and an architecture decisions log can be a great way to share knowledge and prevent frustrations.

When given to new team members it can be a really powerful document, whose reading can save days, during the onboarding process.

Empowering the team culture

Writing down architectural decisions is the first fundamental step to making them first-class entities in the teamwork final result.

It's the most structured way to align officially the whole team about the work behind important decisions.

If the team is big, it can reduce the distance between design and development. In this way, everybody can understand decisions that otherwise, from the bottom, without an overall view, could seem incomprehensible.
Tracking decisions is also something that contributes positively to the team culture, because between the lines it says "for this team, the design counts".

A matter of responsibility

As a software architect, or as a technical leader, or even as a team with shared leadership, one of your most important responsibilities is taking these decisions. Not writing them down could dangerously suggest that things are happening without a reason, driven by pure instinct, or worse randomly.

Your decisions could have an impact for many years, even if you will leave the project. It's a matter of professionalism, seriousness, and respect for those who end up with the system in the future.

Also, in the future, the ability to go back to old decisions and understand why they were taken, has many advantages, which include:

You will have the possibility to remember details that you will undoubtedly have forgotten.
If the decision, seen from the new context, will still appear as a good decision, good news for your self-esteem. You have been forward-looking or at least you're a lucky person.
If it will seem a bad decision, well, you will have the possibility to go deep into it, and you will be able to make a valuable retrospective and understand what happened:
- Option 1: The context was completely different, and now you can say that in that context it was the best option, or at least a good one. You did your job well. You are a software engineer, not a fortune-teller.
- Option 2: The context is pretty much the same, but you are now able to consider other variables, that you missed in the past, and that change your point of view. You are growing, you have improved your judgment, and your next decision will be more accurate.
- Option 3: The context is exactly the same, and at this point, it's official what was already clear to everybody: you made a mistake. Anyway, it was a difficult decision, and at least you have a document that shows that you did your best, you’re not just a fool. Learning from your mistake, you're already a better professional now.

Other indirect advantages

Usually inside an ADR are also documented, in a more or less detailed fashion, the findings from investigations explored but not implemented. Describing the option that you are identifying as the best one, it's almost inevitable to explain also why the other options are not so good.

This information can have a huge value in the future, but is frequently lost, because there are no evident and direct reasons to track a solution that seems inappropriate.

Another last-but-not-least advantage is that writing down an important decision is a way to force ourselves to be objective and wise.

No one would ever write something like "We do this because we want to play with that sexy technology", but this is something that happens often. If you are a developer you know that it's true.

Is just-plain-text mandatory?

In this field there are no absolute laws.

If a decision involves a complex relationship that could easily be described with a diagram, I think that it's okay to add it.

Anyway, this should be only considered as an option if strictly necessary to avoid confusion, because in most cases it's not needed.

It's likely, on the contrary, that if you feel the need for a diagram there is something wrong. It's a wake-up call that you are mixing more decisions into one.

The lifecycle of an ADR

Every decision should start from a draft version, usually in proposed status, then move to accepted or rejected, and so on.

The document should be managed with an append-only approach. Of course, you can edit any part of the decision to fix a typo, but the only part of the document that should evolve is the STATUS paragraph.

If the decision is no more relevant it should simply change its status to deprecated. If the context changes, a new ADR should be created, and the old one should move to status superseded, and so on.

In my experience, it's useful to add cross-links between ADRs. I'm used to creating links both when the relation is direct (for example "Superseded by ADR0027") and when there is simply some kind of relation between two decisions (for example the second one is a consequence of the first one).

If you've ever used Jira or a similar issue-tracking tool you probably get my point.

WHERE should ADRs be stored?

I think that there is no right answer, it depends on the team culture.

The most common solutions are:

a wiki platform
a git repository

I think that there are pros and cons for each solution. By the way, this decision could be a good candidate as a subject for the first ADR... ;-)

Store into a wiki

If you think that ADRs should be easily accessible to everyone in your team, including non-technical people, this is for sure the best option.
In addition to that, having them in the wiki is good because it's easier to link the ADR from other project documentation pages, without leaving the platform.

In my opinion, this advantage is also a drawback, because it puts the architectural decision in the field of "documentation", which is not exactly the passion of each developer.

Store in a git repository

If you want the developers to perceive ADRs as something important, using git is for sure a good starting point. Developers love code and love git, and in their mind, the subliminal message is "if it goes into a repo, it's important".
Git is a great tool also to enable collaborative management of the release process of a new ADR, leveraging pull requests for decisions in proposed state.
There are also some good command line tools that could work well to automate the process. I still haven't tried them, because for the time being I have not felt the need, but it could be an interesting option in the future.

There is also an interesting win-win tool called log4brains, that can turn your markdown into a static website. It's also integrated into MADR template.

I haven’t had time to try them yet, but it could be something to look into in the future.

WHO should be in charge of tracking ADRs?

It doesn't matter.

This simply depends on the team and its culture. It could be a software architect, an entire team, a single developer, an ivory-tower architect, a technical leader, a senior developer or anyone else.

Just to be clear, this is a very important issue, but from this perspective is irrelevant.
What really matters is rather WHEN.

WHEN should an ADR be created?

When the decision will affect the work of the developers
When the decision will be hardly reversible
When the decision is not obvious to the entire team (but pay attention, the team will change, sooner or later)

Some final tips

While compiling the ADR, be very detailed about the context.

This is the part that in the future might differ more, so it's important not to be vague and not to assume anything.
Describe the context with just facts, not opinions.

Take your time thinking about the consequences, because it's at that moment when the deepest reflections will come to light.
And maybe you could even completely reconsider a decision that just a few minutes before seemed so good.

References

This is the fundamental article from Michael Nygard, published in 2011, that first introduced the idea of ADRs.
The GitHub repo joelparkerhenderson/architecture-decision-record is a good source to get an overview of tools, articles and examples.
The ADR GitHub organization
The MADR template

TLDR

If you are responsible for architecture decisions and already don't, start documenting them. Now.

It's fast, easy, not time-consuming, and there are tons of good reasons.

How to pick the right managed service

Claudio Taverna — Mon, 22 May 2023 18:19:09 +0000

A wide range of options is great but an informed choice can be quite time-consuming

During a feature development, it may happen that it's advisable to introduce a new managed service, for example a new database, a new storage type, or a computing or integration service.

In some cases the choice is quite easy, because in the context of the project, there are no equally valid alternatives or because that service is universally recognized as optimal, or at least the preferable one.
To be clear, there are always alternatives to consider, but in some cases it doesn't make sense to spend a lot of time evaluating all the options because, one of them is very clearly recommended.

A fairly easy example

Let's take the example of a use case where a software component needs a cache to reduce the number of requests made to the database and increase the throughput. It is an application hosted on a cloud provider, and there are no tight budget constraints. Data volumes are not extremely heavy and we need a simple in-memory key-value store. The best strategy could be almost obvious: use Redis as a managed service. Minimal infrastructure impact for the DevOps team, adoption of an established and standard tool, known and loved by developers, etc etc.
We could talk for hours about all the possible alternatives, nuances of configuration and different possible options, but all in all, I think that 90% of people would agree and would suggest the same strategy.

A not-so-easy example

The choice is not always so easy, sometimes there are many managed services that would perfectly fit on the use case, or that at least, in slightly different ways, could let us achieve the desired result.

In these cases the variables to consider are many, and it's not easy to put them all on the table and make the right decision.

An example that recently happened to me is: which storage should I use for metrics coming from IoT devices? This is a project on AWS, and the possible options here are many:

Amazon Timestream? With what lifecycle policy?
Amazon DynamoDB? With what primarykey configuration?
OpenSearch or ElasticSearch could work well too?
Or Amazon Managed Service for Prometheus?
And what about a good old partitioned table on Amazon RDS for PostgreSQL?
In addition to the write load, do we have to manage intensive reading?
What reading patterns should we use?
What will be the data life cycle?

The issue is not simple, and the effect of this choice can have a huge impact on costs, on the time needed to achieve the result, and on its final value.

A rational analytic approach

The most natural approach, since after all we are software engineers, is to ask ourselves some questions, pull out something measurable, and weigh the pros and cons.

If it's a new service, for example, it would be right to ask ourselves:

What is the cost model? Is it sustainable?
Having no costs' historical data, are we sure that there are no side costs?
How long can we invest in a POC to validate the theory and test it?
What is the delta value that the service would provide in the short and long term?
Have we considered all the extra costs of an extra service? For example those related to additional effort for DevOps processes, monitoring, and maintenance?
Are there any indirect benefits to its adoption? Would it bring value to the project also for other future features?
Could this new technology become a problem in the future because of the lack of skills in the team or in the company?

If we are evaluating an already used service, the questions will be slightly different, but still quite complicated, for example:

Is the cost model that we already have a 100% valid time series, in this new scenario?
Are we making compromises on alternatives?
Are we sure that we’ve considered all of the alternatives?

It's easy to opt for a good risk-free solution instead of a great one with some risk margin, but it may not always be a winning strategy, especially in the long run.

There are also many other general questions, not so exciting for a typical developer, that are often forgotten:

Is the service mature and production-ready?
Is it already listed as generally available?
Is the support service suitable for our production environment?
Is the service geographically available in the region where the project will be deployed?
Is it or will it be available also in the regions where an expansion is planned?

This is not meant to be an exhaustive list, I’m definitely forgetting several points, and of course the specificity of the single project would produce more specific and detailed questions.

A further bad news is that to follow such an analytical approach we should answer all these questions for each of the possible options.
Also, when the choice is not only on a single component but involves more than one, the amount of possible variables grows exponentially, and consequently the number of assessments in charge to the decision maker.
In these cases, this type of approach, which is always preferable in principle, could become very challenging, or even unsustainable.

An intuitive approach - the expert's opinion

In order not to get bogged down into a never-ending analysis, at this point the only way out is to use the experience, built mostly on past mistakes and successes, to make a choice dictated also by a small instinctive component.

Opinion is the medium between knowledge and ignorance.

--Plato

The aim is certainly not to make irrational and not-data-driven choices. But from my point of view, it is sometimes crucial to try to simplify the problem by removing some options from the table without major time investments, for example looking for just a single reason to discard them. Sometimes these options are possible, reasonable, and maybe even could bring some remarkable strengths, but we can still assume with a certain degree of security that they will not be those with the best cost/benefit balance.
A few examples:

The service costs much more than other options, and for the project the cost is a critical point (has ever existed one where it wasn't?)
The service demands very specific knowledge that is not available in the team, and there is no time to acquire that competence
High performance is required, and other solutions are definitely better

Reducing the number of options by relying on experience is the best way to make sure that the remaining options are an acceptable number and so that an analytical approach is sustainable.
When it comes to software architecture is always good to be flexible and find a balance, sometimes accepting some compromise.

Last but not least, once the decision is made, it is fundamental to document it as an ADR (architectural decision record), for a long series of good reasons, that maybe in the future I will talk about in a dedicated article.

TLDR

Making a conscious choice from the rich offer of cloud providers is tough, and requires a multi-disciplinary approach. But there are some questions that we can ask ourselves to simplify the process.
It's good to be analytical but to be faster enough for the business it's equally important to be pragmatic and leverage your experience in order to reduce the amount of evaluations.

Software architecture diagram arrows

Claudio Taverna — Sat, 13 May 2023 21:40:53 +0000

How to manage a relationship that sometimes "it's complicated"

It's just an arrow!

Arrows are used in almost every types of diagrams, and software architectural diagrams are no exception.

Although it may seem something obvious, almost elementary, the meaning of arrows is by no means a universal concept.

Let’s take a minimal, clear and simple scheme:

Component A has some relation with component B

Ok, that's not just an arrow

The clarity is only apparent, because observing these two components connected by an arrow we can imagine many different interpretations, each of which does make sense but with very different meanings:

A calls B
A depends on B
A is a client of B
A calls an API exposed by B
There is data stream from A to B
A sends a message to B

I can stop here because the sense is definitely clear. Some of these interpretations are partly overlapping, but excluding the nuances of meaning we can reduce them to two fundamental relationships, however diametrically opposed:

A sends data to B

A makes a request to B
(therefore assuming that B will send some data back to A)

If the signifier assumes an opposite signified for different recipients, it's clear that we have a problem. The primary purpose of drawing a diagram is usually to clarify the relationships between different components, so it is not acceptable to communicate in such an ambiguous way.

There are also other aspects, definitely no less important, that an arrow itself is not able to express: is the communication synchronous or asynchronous? Is it a single call or there are dozens of them? If the recipient of the arrow is a database, is it a query or an update? If it’s a message, is it an event, a publish/subscribe stream, or a double channel for request/response? Is it an HTTP or grpc call? Or maybe it’s a TCP socket connection?

A single diagram cannot provide all the answers, and this is the reason why I think that when you draw a good diagram you should try to provide only some information, as clearly as possible.

So what?

To add details there are only a couple of possibilities:

Add a label to each arrow

This has the advantage of making the meaning quite clear, but it's not always the best option because the diagram will inevitably lose readability.

Introduce a legend to the diagram

In this way, each graphical variation of the arrow will have a different meaning. A great option to keep the diagram clean, but for the viewer can be tiring to learn the linguistic convention.

In both cases, there is a considerable trade-off between readability and explicitness.

In addition to that we can also leverage a good title to define clearly the purpose of the diagram, and, as a consequence, the most likely meaning of the arrows.
For example, in a diagram titled "Data ingestion flow", it's probably pretty much easy to imagine that the arrows are here to explain the direction of the data through the system, and not if the single call is a pull or a push.

In my experience it is also useful to attach a short description to help the reader interpret the diagram, providing a high-level overview that describes its purpose.

Obviously writing a description or a legend are time-consuming activities, and therefore you have to evaluate the cost/benefit ratio.

Fortunately, in some cases the context can be very useful to clarify the meaning of an arrow, or any other symbol within a diagram, but it's risky to assume that the context is clear to everyone and that any recipient will always be able to understand what we have in mind.

Of course if the recipients of the diagram are people with whom you usually work, there is likely a kind of shared common language, so in most cases it's not necessary to be super specific. Also, in case of doubt, it's easy to ask for clarification.

But when the diagram is going to be shared with another team, it is definitely a good idea to spend some extra time to make it as much clear as possible.

This can minimize the likelihood of future misunderstandings, which could cost much, much more.

TLDR

There's no secret recipe.

An arrow itself just tells that probably at least one of the connected parts is aware that the other one exists.

To add more information you have to add details, and the tough part is to decide how much it makes sense to be detailed.