DEV Community: Ayman Patel

HTMX for Python: Introducing FastHTML

Ayman Patel — Tue, 08 Apr 2025 06:26:42 +0000

🔥 A new minimalistic Python implementation to creating web apps!

FastHTML, which can be started with a 6-line Python script. But can scale in creating complex web applications with no full-page reloads thanks to HTMX.

🚀 It is small and fast thanks to leveraging:

HTMX: For enhancing HTML to bring rich client applications. HTMX also allows to bring HATEOAS as first-class citizen for end-to-end applications perspective.
ASGI: Implementing the ASGI spec for concurrency. ASGI spec is implemented by using Uvicorn and Starlette.

🏢 Deployment Targets:

Vercel
Railway
Hugging face and more

📖 Readings:

FastHTML https://fastht.ml/

FastHTML Docs: https://docs.fastht.ml/

Upgrading Java and Spring without hassle

Ayman Patel — Tue, 08 Apr 2025 06:22:34 +0000

🤔 Want to upgrade Java, Spring Boot versions is a methodical way

There is a platform for that!

💡 Moderne's OpenRewrite platform helps with managing complexities in upgrading your Java based projects

Some common use cases include:

Upgrading to Java 17
Upgrading Spring Boot 2 to Spring Boot 3

🙋Why a tool?

Knowing the API changes in libraries for every dependency is difficult. Also, a lot of time can be wasted in a particular solution that does not work well with a particular dependency.

🔧How does it work?

You can import custom recipes as Maven/Gradle task and run them.

For example, you want to migrate from JUnit 4 to JUnit 5, you can import a recipe for it in Gradle like this:

activeRecipe("org.openrewrite.java.testing.junit5.JUnit5BestPractices")

And run the gradle task:

gradlew rewriteRun

Since these recipes are chainable, you can add steps to make code changes incrementally.

Other features include:

Writing your own custom recipes for your use-case
Dashboard to track the recipes done in your project to track your level of completion of version migration.

📖 Resources:

Moderne's site: https://www.moderne.io
OpenRewrite recipes: https://docs.openrewrite.org/

ABCs of Databases

Ayman Patel — Sun, 14 Jan 2024 16:39:17 +0000

When we talk about creating APIs with a database for storage; we always think database to be a storage layer which would do its job of doing CRUD operations. But their is a unconscious belief of the API to be single user, single transaction. Even if we think that multiple simultaneous read/writes happen; we assume that data will be in the state that we want. But there are a lot of nuances in how data gets persisted. In order to know about this; there are several topics that need to be covered. These could be:

ABC of database:
What Transaction Isolation Levels is configured at database? What is impact of consistency vs latency with these settings?
Queries planner and Execution : How does your query formed by your database affecting the performance of the system

All these are just high level questions which need to be seen in depth. The first point (i.e ABC of Database) is the starting point to understand the rest of the points; which will be the focus of this blog.

ABCs of Database

Just putting the acronym out there:

ACID

1.a. Atomicity

It states that transaction has started; it should either be completed or rolled-back if an error occurs. In DB terms; either transaction should be COMMITTED or ABORTED This mechanism is achieved by

REDO/UNDO mechanisms such as REDO/UNDO logs to bring data to the correct atomic state etc
Shadow Paging It follows Copy-on-write mechanism where the parent process forks and creates a shadow page for uncommitted transaction which is either: F

All in or Nothing

1.b. Consistency

This rule is little bit confusing. Especially when you pair up with BASE(Eventual consistency) and CAP. In spite of confusion; Consistency from ACID is the clear and most thought out definition out of the 3 (ACID, BASE and CAP) acronyms. It states that once database starts the transaction consistently, it should end consistently. Consistency is enforced by Applications with the Integrity Constraints.

Transaction starts in Consistent Manner; And Ends in Consistent Manner(All Integrity constraints are followed)

Consistency is enforced by integrity constraints. This could be Primary Key, or Constraints such as Foreign Key, NULL constratints, CHECK value constraints such as ACCOUT>=100 etc.

1.c. Isolation

It states the transactions should be isolated from each other.

For example; Given Bank Account has 100$ , When Alice has a card that they withdraw 25$ and Bob starts the transaction simultaneously and withdraws 30$ Bank account at the end is 100-(25+30) = 45$

The Math checks out.

1.d. Durability

It states that if transaction has started and completed (COMMIT is done); its effect should be persisted, even if there is a system failure. If there was a transaction that has not been completed, it should be rolled back to the previous completed state. It should be noted that databases commit first to Buffer Pool and then to disk. Durability is a guarantee at the disk level.

Similarly to Atomicity; Durability uses REDO/UNDO or Shadow Paging mechanism to ensure Durable state.

Are you durable, even when you are down and out?

BASE

Before we go into BASE; we need to know the history of misnomers in database. BASE and NoSQL are really what they are intended to be. Just for sake of blog, I have added this section. This acronym (BASE) should be taken by grain of salt. This is marketing gimmick in order to hide the real important concepts such as CAP (and later PACLEC) and the OG ACID.

Even the acronym is the mash of 3 things (Basic Availability, Soft state and Eventual Consistency) fitting into a 4 letter acronym and an alternative to ACID from the chemistry world.

2.a. Basic Availability

As NoSQL prioritises Scalability and Availability over transaction correctness; it needs to be available at all times with highest five 9s percentile (99.999).

2.b. Soft State

This is related to eventual consistency. It basically is a disclaimer that the data available in the database is not the final state. Due to eventual consistency across various nodes; the data will not be guaranteed to be write-consistent or mutually consistent across nodes.

2.c. Eventual Consistency

As there are multiple machines in NoSQL databases due to various reasons such as sharding, horizontal scaling and goal is to be as fast as possible; data might be distributed across various nodes; and whichever node gives the answer first; is treated as the response to the API request. This means that data is consistent eventually across multiple machines. In the initial days, consensus was not the norm in NoSQL databases and hack ways to fan-out requests and pick the first one in order to be first was the norm. Consensus protocol such as Raft and Paxos where then integrated to have consistency (at the expense of performance.)

CAP

3.a. Consistency

This is not to be confused with Consistency in ACID.

C in ACID is meant for transaction as a series of steps inside the single node; whereas Consistency in CAP theorem considers a distributed environment where there are 2 or more machines/nodes. It resembles more of linearizability

Formal definition of linearizability: If operation B started after operation A successfully completed, then operation B must see the the system in the same state as it was on completion of operation A, or a newer state.

In the diagram, there are 5 steps which ensure data is distributed consistently.

Server A sets A's value as 3 to the primary database
Primary database starts to propagate this info to the Replica database
A=3 travels across the network and reaches
Primary database acknoledges that this info has be sent to Replica database
Server B reads from Replica database and gets A=3

if step #4 .i.e. Primary Acknowledging that transaction has been committed , then the replicas would have consistent data immediately (No Eventual consistency)

Consistency in CAP in simple terms:

Alway return up-to-date information

3.b. Availability

Consider that one of databases is down and consider the diagram:

Server A sets A's value as 3 to the primary database
Primary database starts to propagate this info to the Replica database; but finds that Replcia database is down
Server B reads from Primary database and gets A=3

Availability in CAP in simple terms:

System must return information, even if out-of-date/stale. Even if node goes down, there will be another node that brings information.

3.c. Partition Tolerance

In the diagram above; the network is down. In this case Server A and Server B are on their own. So what do they do? They read from their own databases. Server B's database thinks of itself as the primary database and it cannot connect to the primary database to get the latest info. When the network is established, the database have a reconciliation process wherein the data is brought to a "consistent". Note the asterisks; the consistent can become messy. Some databases look into lamport clocks to figure to the last update. If things are even more messy, then the application code does some woodo magic to make the data consistent enough.

Partition Network tolerance in CAP in simple terms:

System continue operating even if Network link has been severed.

CAP theorem states that a distributed system can follow only 2 out of 3 acronyms. The trade-off is always there when you have more than a single database. All in all, there is no free lunch.

CA

CA means that data is the latest COMMIT which meands there are no inconsistent data in the distributed data sources as a whole. Mostly relational databases follow this.

CP

Theses systems will not always answer if they are not available. But if they do, you can be sure that the answer is correct due to its consistency guarantees.

AP

These systems will always give an answer, even if it is not the latest. Social media sites in early days; in hope to be very highly available and requiring partition for different regions/campuses (like Facebook), had loads of bugs where site loaded but posts and comments would come and go.

The following VEN diagram is the list of databases supporting CA, CP and AP schemes.

Critique of CAP theorem

In spite of bringing a way of trade-off analysis to distributed system, in real world, it is not sufficient model to understand tradeoffs. A lot of times, Partition tolerance is needed and not a optional thing in the age of Mult-node deployments and on-soil regulations. Other critique is the network failure as the only failure is a wrong assumption. Databases also suffers from Murphy;s law. Power outage, Disk corruption etc are not part of CAP but are real scenarios that need to be considered. Martin Klepmann has a good blog on the critique of the CAP theorem which can be found here. (Paper format with even more details are here)

PACLEC: A better CAP

A more nuanced framework for comparing NoSQL databases

It stands for: Partition Tolerant Always Available Consistent Else, choose Latency Consistency

And yes, it is an acronym with a IF/ELSE statement.

If (P)artition; Then (A)vailability and (C)onsistency
(E)lse; (L)atency and (C)onsistency

Let us consider the left-hand side. If partition is all good; then you follow CAP theoem; albeit the CP or AP rule only.
For the right-hand side, if network partition is down; then you have to make a tradeoff between Latency and Consistency. If you want your system to be fast then you have to sacrifice consistent data. Conversely, if you want your data to be consistent, then you have to wait because of internal consensus (RAFT, Paxos, whatever) of the correct data.

Watch this video from Dr Daniel Abadi on PACLEC (author of the concept) from ScyllaDB channel

Closing thoughts

This is just the start of understanding the basic terms of databases. There are many more concepts such as Memory Management, Buffer Cache, DB data-structures, Query planning & Execution, 2PC & Quorum, MVCC, Columnar Storage, WAL & Journaling and much more which is greatly covered in these 2 courses

CMU 15-721: Advanced database systems: Playlist
CMU 445-645: Advanced database systems Playlist

eBPF - Unleash the Linux kernel

Ayman Patel — Sun, 07 Jan 2024 17:12:45 +0000

User Space vs Kernel Space

In order to understand where eBPF comes into the picture, first we need to understand the basic difference of User Space vs Kernel Space

eBPF fits into the kernel space; but giving a user-space analogy of the significance of eBPF to the Operating System would be what WASM was for the browser. WASM allowed applications to be written in non-javascript to be interpreted in the browser. It allowed products like Figma to be developed for the web. Sketch and Invision would not be able keep up with the power of Figma's powerful editor. Good news is that eBPF for its revolutionary and novel will aid in observing the kernel space and not bring any technologies on its knees.

Why eBPF

Say you want to look into what is happening inside the Linux; I mean really look into the nuts and bolts; you might use an interface from User Space that can talk to Kernel Space which would give information such as File IO, Network Traces etc. but this would cause 2 issues

Any unexpected crash at the Kernel Space level can lead to lack of observability on root cause of failure
As it is an interface, the hardware or other Kernel Level details could not be visible from User Space

That is where eBPF comes in and shines. It is a sort of Virtual Machine that hooks up inside the Kernel which allows visibility to the User Space from the Kernel. This opens up the pandora's box of possibilities of logging, instrumentation and security applications that could leverage eBPF.

History of BPF and eBPF

eBPF's name has been derived from a old technology called as BPF (Berkely Packet Filter).

BPF

BPF was first conceived in early 1990s (before Linux became mainstream) as a way to intercept network traffic from the kernel itself instead of relying on user-level processes. It was a network tap (monitor network traffic) and packet filter (filtering out unwanted packets to reduce noise in netowrk monitoring)

eBPF

Video Recommendation : eBPF: Unlocking the Kernel

In early 2010s, there was a need for having better observability tools that could be defined in a software instead of hardware. Also, all the things that things that went from User Space to Kernel Space could crash unexpectedly without actually knowing what caused the crash at the Kernel space layer. Alexie wanted something that an application at User Space could call which would be a hook/probe-point to the Kernel which would further make a decision to send information to the User space. This is where eBPF was born.

eBPF Architecture

eBPF programs are called by hooks/probes either by Kernel or the user-land application. These hooks are pre-defined are include

System calls
Function entry/exit: Custom Programs can attach to entry/exit functions so that they can run at these scenarios.
Kernel Tracepoints: Tracepoints are lightweight hooks to call a function at runtime. It is used for tracing and perf analysis at the kernel.
Network interfaces via XDP: eXpress Data Path (XDP) allows custom programs to attach to eBPF which can execute these custom program when the network packets are received.
LSI Module interface etc.

If the hook/probe are not available for particular use-case; it is possible to attach probes to eBPF at both user and kernel levels. These are called uprobe and kprobe respectively.

Loading

Before a BPF can be run on the kernel; it requires to be Loaded with the help of some eBPF loader libraries:

eBPF by Cilium: Go-based eBPF loading library
libbpf

Verification

Need to ensure that the eBPF program is safe to run It make sure that it follows various conditions such as:

Program is run by priveleged eBPF program (unless stated otherwise)
Program does not bring the system down
Program always run to completion

JIT Compilation

Translating the eBPF bytecode (generated at User Space ) into machine-readable code.

eBPF Maps

These are required to hold and retrieve data. These hold wide variety of data such as

Hash Tables
Arrays
Stack traces
etc...

Helper calls

eBPF do not call kernel functions directly in order to make eBPF loosely couple with Kernel versions. eBPF call function calls to pre-defined helper functions defined by the kernel. These helper functions include:

Current time and date
Process/cgroup context
Network packet manipulation and forwarding logic
Random number generator

Tail and Function Calls

eBPF allows tail call of other eBPF functions which allows for compatibility and extendability of eBPF programs

Ensuring eBPF is safe

As eBPF is a very powerful concept, which allows user-level programs to hook into kernel level details; it is imperative that the architecture of eBPF ensures that such a technology does not break the system when used.

Writing safe eBPF programs

This includes

eBPF programs should be non-blocking. eBPF program can contain a loop if and only if, the Verifier can ensure that loop exits!
eBPF programs do not use out-of-bounds memory
eBPF programs should be small
eBPF

Hardening eBPF

Kernel memory inside a eBPF program is read-only.
Spectre migration: Spectre was a big vulnerability in CPU architecture which allowed for out-of-order branch execution which lead to sensitive data access to attackers. eBPF prevents Spectre-type attacks at the Verifier level
Constant Binding

eBPF Applications

Networking

Networking use-cases for eBPF include doing Traffic Control , controlling network policy (via XDP) etc Tools:

Observability

Send Kernel Level details to observability platform

Grafana Beyla allows instrumentation of HTTP and HTTPS services from the Linux kernel to Grafana directly.
Cilium's Hubble

Tracing

Want to look into production tracing and troubleshooting. BPFTrace is the tool for you

If wanted to know how the VACUUM process is happening in Postgres under-the-hood. BPFTrace can help with that! Check this article to monitor Postgres's VACUUM process

List of eBPF based tracing tools

BCC (Toolkit for creating kernel tracing tools)
BPFTrace

Security

Traditionally, auditd was used for auditing things happening in the Linux OS; but as it is user-space component; there is some performance penalty.

Low-level observability through eBPF can aid in finding alerts for kernel level changes by attackers. For example, when application changes privileges are changed it can trigger an alert to a listening eBPF custom program

Security Libraries that are based on eBPF

Tetragon
KubeArmour: K80based Security engine which used eBPF and Linux Security Modules (LSM)
Falco

The quest to improve Supply Chain Security

Ayman Patel — Sun, 31 Dec 2023 03:30:10 +0000

Introduction with a sour taste

Writing software is hard. Maintaining is harder. Securing it is the hardest. The attack vectors keep on increasing, year on year, as and when new features introduced. For instance, Log4j vulnerability that was seen a couple of years ago in December 2021, was because of a JNDI interface that was introduced in 2014. (Blackhat attack talk on the exploitability in 2016!) Another example would be that of Panama Papers leak , where the open-source CMS Drupal was the root cause of the hack.

Third example which nails down is the Equifax hack; whose root cause of not updating their Apache Struts vulnerability for 6 months after a 0-day exploit was found.

Recent high-profile hacks which happened after the pandemic on Colonial pipelines, Microsoft Exchange server, Log4j, SolarWinds Hack; has made the people and governments weary of the security of software systems. Government has come up with legislation in order to curb these hacks.

Application Testing Tools

SAST

Static Application Security Testing

The security of application is determined by scanning static code. It does not have any information on the running application which can lead to some false-positives.

DAST

Dynamic Application Security Testing

The security of application is determined by running test on the running application. This can be achieved by sending malicious code/input fields . As it does not look into the code, it has not context on root cause of a security vulnerability. Hence, it will be difficult for developer to understand and guage what architectural/code issues are the root cause of the security vulnerability.

IAST

Interactive Application Security Testing

The security of application is determined by running an agent inside the application similar to agents deployed for monitoring such as Dynatace, eBPF etc It has advantages of both DAST and SAST. It can run against deployed application as well as see the source code.

Following illustration is on what DAST, SAST, IAST tests:

SCA

Software Component Analysis

This provides information on dependencies or libraries. Information include

Downstream vulnerabilities
License risks
Library health (how well maintained is the library)

SCA's provide an interface with dashboards on number of issues (license, vulnerability and package health); and also provide links to CVEs which can aid developers and security engineers to get all information of the software dependencies at a single place.

The following caability matrix by veracode provides the capabilities f each tool:

SBOM, VEX and CSAF

According to US ruling, it is mandated for companies to include a software-bill-of-materials. Generally, bill of materials is traditionally a list of items that was used to build a particular product. For example, you might have heard of card recalls due to some fault at brake discs. BOM allows for auto-manufacturers to pinpoint the raw material to the cars where it was used, so that they can trace from raw materials to the cars where the materials were used. In software also, there are standards to define the components of software, which are mainly libraries.

SCA might sound similar to SBOM; but goal of SBOM is to collaborate with other systems. SCA is vendor-specific, but SBOMs are driven by foundations such as OWASP and Linux Foundation, which provides interoperability for further things such as advisory framework, exploitability of vulnerability as well as signing of software packages to maintain authenticity of the packages.

SBOM

There are 2 standards for SBOM

SPDX (by Linux foundation)
CylconeDX (by OWASP)

SPDX

This was created by Linux foundation in 2011 to track software licenses. After some years, information regarding the materials/components of software was added.

CycloneDX

Cyclone DX was created by OWASP for combating vulnerability identification, outdate softwares and license compliance. It not only included SBOM but also HBOM (Hardware), OBOM (Operations), SaaSBOM (Software as a Service), VDR (Vulnerability Disclosure Report), VEX (Vulnerability Explotability eXchange). VDR and VEX is more important and useful when used with SBOM for software engineers in order to find vulnerabilities, create a report on impact through advisory and finally making a decision if vulnerability is explitable in the current software stack.

SBOM Tools

Anchore/Syft

Generate SBOM for your container images, libraries, filesystems

a. Generating CyclineDX and SPDX using Syft

b. JSON output from SPDX and Cyclone DX

Anchore/Grype Works with Syft and does vulnerability scanning for containers as well as filesystems.

Synk provides all tools under 1 hood. It provides tools such as

Synk OpenSource: Identify vulnerable open source libraries
Synk Container Image: Securing container images
Snyk IaC: Misconfiguration of IaC secrets and policies
Snyk Code: SAST offering form Snyk

Aqua

Aqua vSheild: Mitigation strategy for vulnerability
Aqua Trivy: OSS Vulnerability scanner, IaC protection

VEX (Vulnerability eXploit)

Goal of VEX is to provide information on the **to communicate the exploitability of components with known vulnerabilities in the context of the product in which they are used."

Having a lot of libraries can lead to a lot of noise if there is a vulnerability. The vulnerability can be an obscure function which is hardly used. So in order to combat such scenarios, where we need clarity on the risk of the vulnerability so that appropriate action can be taken on fixing it or keeping it on hold for some time as it is not. arisk at this point of time.

Goals of VEX:

VEX can be inserted into a BOM (CycloneDX); or have an dedicated VEX BOM.
Provide clear details on vulnerability, exploitability.
Bridge the communication gap between software consumer and software producer. It can communicate to software consumer on the actions taken by the software producer and actions to be taken by consumer to minimize security impact.

CSAF

Common Security Advisory Framework (CSAF) is a language to communicate security advisories. It is the final step of understanding the software components, and providing enough information in order to make the decision of remediating software vulnerabilities.

It is maintained by OASIS Open and a CSAF v2.0 standard can be found here

Example CSAF JSON

{ "document": { "category": "csaf_vex", "csaf_version": "2.0", "notes": [{ "category": "summary", "text": "Example Company VEX document. Unofficial content for demonstration purposes only.", "title": "Author comment" }], "publisher": { "category": "vendor", "name": "Example Company ProductCERT", "namespace": "https://psirt.example.com" }, "title": "Example VEX Document Use Case 1 - Fixed", "tracking": { "current_release_date": "2022-03-03T11:00:00.000Z", "generator": { "date": "2022-03-03T11:00:00.000Z", "engine": { "name": "Secvisogram", "version": "1.11.0" } }, "id": "2022-EVD-UC-01-F-001", "initial_release_date": "2022-03-03T11:00:00.000Z", "revision_history": [{ "date": "2022-03-03T11:00:00.000Z", "number": "1", "summary": "Initial version." }], "status": "final", "version": "1" } }, "product_tree": { "branches": [{ "branches": [ { "branches": [ { "category": "product_version", "name": "1.1", "product": { "name": "Example Company DEF 1.1", "product_id": "CSAFPID-0001" } }], "category": "product_name", "name": "DEF" } ], "category": "vendor", "name": "Example Company" } ] }, "vulnerabilities": [{ "cve": "CVE-2021-44228", "notes": [ { "category": "description", "text": "Apache Log4j2 2.0-beta9 through 2.15.0 (excluding security releases 2.12.2, 2.12.3, and 2.3.1) JNDI features used in configuration, log messages, and parameters do not protect against attacker controlled LDAP and other JNDI related endpoints. An attacker who can control log messages or log message parameters can execute arbitrary code loaded from LDAP servers when message lookup substitution is enabled. From log4j 2.15.0, this behavior has been disabled by default. From version 2.16.0 (along with 2.12.2, 2.12.3, and 2.3.1), this functionality has been completely removed. Note that this vulnerability is specific to log4j-core and does not affect log4net, log4cxx, or other Apache Logging Services projects.", "title": "CVE description" }], "product_status": { "fixed": ["CSAFPID-0001"] } } ]}

Signing your software libraries

There are a couple of attack vectors that are based on how software is installed. A couple of years ago, security researcher wrote this article which went viral "Dependency Confusion: How I Hacked Into Apple, Microsoft and Dozens of Other Companies". He was able to download private repos from companies. Another example is series of attacks on npm where a misnomer(spelling mistake) of packages led to installation of malware. These sort of attacks could have been mitigated by using digital signatures.

We have encryption which follows the practice of integrity, authenticity and non-repudiation. Software libraries signing can allow us to keep malicious packages out of our systems. There are a number of technologies which are using the signing principles of cryptography but for software binaries/libraries/containers signing.

Sigstore

A standard and a collection of tools for enabling software components to be cryptographically digitally signed.

It includes:

Cosign: Signing software artifalcts
Fulcio: Providing Certificate Authority
Rekor: Transparency log
Gitsign: Signing Git commits
OpenID integration: Authentication to check identity of requestor

Can see the demo of sigstore here

Resources

Infrastructure as Code - A History Primer

Ayman Patel — Sun, 17 Dec 2023 13:06:21 +0000

Phase 0: BASH

BASH

In the early days, our configuration where written in BASH scripts. But the POSIX compliance nightmare meant that these BASH or Shell or Fish or whatever shell your OS is on became a nightmare to debug and extend. !#/usr/bin/sh should have been the norm, but alas here we are trying to mitigate this disaster by creating an abstraction on top of OS.

Phase 1: CFEngine and the rest

1990s brought in the abstraction of OS-based IaC tools. it started with CFEngine as a side project by Mark Burgess (Post doctorate student) to automate the workstations at his university. This sparked configuration management tools such as Chef, Puppet, Ansible etc. in the late 2000s to early 2010s The above just gives a timeline of when these were created. These were fantastic tools at the time, when people rented servers but not from the cloud giants such as AWS, Azure, (probably GCP) etc. When the clouds came in, there seemed to be a shift in IaC to have tools which

Interacted with the cloud providers
No have a vendor-lockin

That is when tools such as Terraform came which provided an abstraction over the cloud providers.

Before we go further, we need to understand the pure basics of IaC tools. Before the event we need to see what IaC tools are not. IaC tools are not:

Programming languages: Terraform has custom DSL, Chef has Ruby DSL, Ansible is YAML-based, and Puppet has its own IaC DSL called Puppet Code. So there is nothing common between them.
Target systems: Chef/Puppet/Ansible are built to configure servers and not cloud configurations; while Terraform, Crossplane, Pulumi were built to provision cloud servers.

So what exactly is IaC's core value? It is creating a state machine with 2 states (Current and Target); and 2 interactions (Coverge and Compare).

We also need to add schema constructs to see what the IaC needs, as well as some sort of type-checking on what values can be passed. We'll discuss other aspects such as this in the Cue Section

IaaS, PaaS, FaaS

Apart from the traditional cloud; there was also a burst of different runtimes in the 2010s. There was OpenShift, Cloud Foundry, Kubernetes, OpenStack. All these tools required a way to be configured in its way. Mostly there was a YAML was the go-to to describe the metadata for the infrastructure to be deployed

YAML enters the chat

However due to issues with readability and lack of extendibility; there was a shift in the industry to create an abstraction that will also work in a multi-cloud environment. Whilst the first-age IaC tools (Chef, Ansible etc.) were great for creating server resources; they became a laggard in the scheme of cloud providers. The abstraction that seemed to work for most folks is YAML. Yes, YAML is far from perfect. Indentation issues are just the tip of the iceberg. Did you know of the Norway problem?!

Yeah, writing NO can break the YAML parser.

It is great for readability but a very nice way to "shoot into your foot by CTRL-C and CTRL-V". Not just IaC, but your environment variables, and properties.yaml are also IaC code which are usually key-value pairs. If these are copy pasted without having any validation logic; then you will have a nice surprise waiting for you when you deploy your code to higher environments.

Phase 2: Terraform and modern IaCs

Terraform, Crossplane and Pulumi are great tools for writing IaC. But the issue it writing your adapter for your custom Infrastructure for your specific company can be a blocker in incorporating tools.

Pulumi is a programmatic IaC tool where you can write your IaC code in Javascript, Go etc

Crossplane is a new way of writing IaC. It is not a simple CLI tool like Terraform. It is a control plane that always observes the infrastructure.

CUE: A guardrail to write custom configurations

Any IaC, deployed anywhere needs some basic principles.

Schemas: Schemas are required to validate the structure of your config file. If you are deploying to Cloud Foundry; need to validate your manifest.yml file, in K8s you configure your kubectl file. All these tools bring the onus of correctness to the developer without providing any structured knowledge within the code on how to write these YAML configurations.
Type-checking and constraints: Having a way to restrict the config value to a certain range of values is a very basic human error. A rule on incrementing the version for every release or not adding an absurdly high storage value to reduce costs are things etc.. should not be a checklist but a validation step during in order to commit the code.
Overlays or Schema Import: Have a way to extend the schema (either externally via schema import or internally within the config file using overlays.)

Cue: History

Cue was a project started by Marvel Lohuizen who worked primary at Borg (Kubernetes equivalient at Google) on the Borg Config Lang.

Its goal is to provide a declarative approach to configuring things based on rules on those data. For example, you want to configure your IP address to a range of subnets only.

Example: Cue Vet

Problem Statement: You want to make sure your S3 Buckets are deployed to an EU region only (for GDPR reasons) and you need to deploy to a specific folder my-compliant-product-folder ; and you have a YAML which is custom to your infrastructure. What would you do?

Write BASH scripts? Too much headache and different OSes (macOS, Windows, Linux, Alpine, FreeBSD) will make this script fragile like a ice sheet.

Write a python validator? Maybe, but that is time spent on creating a framework/DSL in which you are not an expert in.

Enter Cue,

Your bucketSchema.cue file:

id: stringtype: "s3" | "minio"region: "es-east-2" | "eu-west-1"

Your YAML:

id: Rg98GoEHBd4type: s3region: us-east-2

Your cue vet command

  cue vet bucketSchema.cue bucketPolicy.yaml region: 2 errors in empty disjunction:region: conflicting values "eu-east-2" and "us-east-2": ./bucketPolicy.yaml:4:9 ./bucketSchema.cue:6:9region: conflicting values "eu-west-1" and "us-east-2": ./bucketPolicy.yaml:4:9 ./bucketSchema.cue:6:23

Oh wait, you put a us instead of an eu; and you have deployed to US; hence making you non-compliant!

Good thing we ran that vet command. We can fix it by changing our YAML config.

id: Rg98GoEHBd4type: s3region: eu-east-2

Run this and you can see no more errors. This simple construct can make us engineer not go into CtrlC-CtrlV mistakes

For more info on Cue; I would suggest their doc and this video

Its many other use cases include

Data Validation
Policy checking
Cross-language test generation
Cue is able to export schema to OpenAPI definitions or even protobufs!

Marveling on Metrics & Observibility with TSDB

Ayman Patel — Sun, 03 Dec 2023 12:32:53 +0000

Software engineers deal with 2 things. Code they write and metrics they monitor. Mostly we are ok or know what the code is and what it does. But from an observability point-of-view; we are winging it with our naive assumptions on the data that is emitted by our application. We don't even know how frequently our data is sampled, what a time-series database is, and what the querying engine used.

We just dump data into these without considering what the data represents. We also do not consider what to query. Do I query the average inherently wrong response time?

Time Series 101

So what is a time series anyway? Well from its Wikipedia definition

A time series is a series of data points indexed (or listed or graphed) in time order.

In the software world, it is the collection of observable metrics that are taken at a defined regular internal.

It is a kind of emitted event that has metadata info (usually added as Key-Value tags) with a timestamp

Example:

--event_name-|--------[tagKey:tagValue]----------------- |-value-|---timestamp------|cpu_load_usage[location:us-east-one;application:web-sever]->75%->2023-02-02 00:00:00

Gotchas

Gotcha 1: Average and Standard Deviation Suck!

One thing we look at when we see a metric is average (and standard deviation). I have made the same mistake. Averages and standard deviation matter when they are part of normal distribution (which you see in statistics). However real-world metric data is anything but a normal distribution. Also, an outlier always skews your metric to either side, which is a different picture (either better-than-actual or worse-than-actual view for that metric).

Helpful alternatives are median, 50 or above percentiles (Median is p50 anyway!).

That is why it is imperative to use a TSDB. These gather various metrics in various formats while keeping in mind the need for aggregating and sampling of data which is required to gather the real-world picture of the software system that is deployed.

Gotcha 2: Cardinality is where $$$ is to be saved

While you have a TSDB; it makes the development and product team inclined with an urge to custom tags/fields. Well, there ain't any free lunch in terms of storage cost and query execution time. Observability is both an art and science or which metric to measure. This talk gives a good framework for selecting the metrics. Metrics should be moved to buckets of heavily used, moderately used and least used along with cardinality data. Removing less used metrics with high cardinality (more unique, dissimilar rows)

So let the person who owns the Monitoring tool sleep with peace!

Tools

Graphite

Carbon

It is a daemon(aka background process) for handling how time-series data is handled before sending it to Graphite's TSDB (Whisper or Ceres).

There are 4 components of Carbon

1. Carbon-relay

Use for replication and sharding of the data.

Grafana has its implementation of carbon-relay called carbon-relay-ng which is blazingly fast, built-in aggregator functionality such as cross-series & cross-time aggregations .etc. Read more on the Grafana doc site

1. Carbon-aggregator

Used to aggregate metrics. Why aggregate? As too much data can lead to a lot of noise, performance degradation as well and storage costs; carbon-aggregator attempts to reduce the cardinality/granularity of data which ultimately leads to better I/O performance

1. Carbon-cache

It takes data coming from the carbon-aggregator and dumps it to Whisper(or Ceres) for persistent storage. It also loads some of the data into RAM for faster access.

1. Carbon-aggregator-cache

It is a combination of both Carbon-cache and Carbon-aggregator in order to reduce the resource utilization of running both as separate daemons.

Database and Data Storage

Whisper

A fixed-size database which is similar to Round-Robin-database aka RRD

Unlike RDD, Whisper allows to backfill data which allows to import historical data. FOr more differences; read the doc

StatsD

It is not a database per se, but it is commonly used as a data collector in Graphite to send additional information to the graphite instance. Information such as Gauges, Counters, Timing Summary Statistics, and Sets can be sent to Graphite.

Query language

Graphite provides functions to query, manipulate, and transform data from the stored time series data.

List of functions (Exhaustive list here):

| Function | What | Example |
| absolute | Apply the mathematical absolute function | absolute(Server.instance01.threads.busy) |
| add | Add constant to the metric | add(Server.instance01.threads.busy, 10) |
| aggregate | Aggregate series using a given function (avg, sum, min, max, diff, stddev, count, range, last, multiply) | aggregate(host.cpu-[0-7].cpu-{user,system}.value, "sum") |

Data Ingestion

Graphite supports 3 data ingestion methods

Plaintext
Pickle
AMQP

Data Model

There are 2 data formats for Graphite

Simple Graphite message format

Example:

// Formatmetric_path value timestamp\n// Examplestats.api-server.tracks.post.500 -> 93 1455320690

Graphite with Tag Support

A lot of TSDB such as Influx, Prometheus had tag support from the beginning, hence Graphite added Tag support in v1.1 to identify different time series data

Example:

// Formatmy.series;tag1=value1;tag2=value2timestamp\n// Examplecpu,cpu=cpu-total,dc=us-east-1,host=tars usage_idle=98.09,usage_user=0.89 1455320660004257758=>cpu.usage_user;cpu=cpu-total;dc=us-east-1;host=tars 0.89 1455320690cpu.usage_idle;cpu=cpu-total;dc=us-east-1;host=tars 98.09 1455320690

InfluxDB

Database and Data Storage

For the storage engine, InfluxDB uses TSM tree (similar to a Log-structured merge tree) for storage along with a Write Ahead Log. LSM is used in Cassandra and HBase which has good read/write characteristics. InfluxDB takes the database approach, while Prometheus data storage is mostly append append-only approach similar to Kafka.

TODO: TSM explanation: IPAD thingy

WAL(similar to Postgres) is a log for ensuring data reliability due to some unexpected failure to the InfluxDB server.

Unlike Graphite and Prometheus which store on disk as a file, Influx DB stores data in their own relational-type storage engine which can be queried via InfluxDBQL or Flux(soon-to-be deprecated).

Query language

There are 2 ways to query (one of them is being deprecated)

InfluxQL (SQL-like)

All InfluxDB functions can be found here

Flux (In maintenance mode from InfluxDB v3)

   from(bucket:"telegraf/autogen") |> range(start:-1h) |> filter(fn:(r) => r._measurement == "cpu" and r.cpu == "cpu-total" ) |> aggregateWindow(every: 1m, fn: mean)

Prometheus

Database and Data Storage

Unlike graphite originally, Prometheus stores in the following format with tags.

api_server_http_requests_total{method="POST",handler="/tracks",status="500",instance="<sample1>"} -> 34api_server_http_requests_total{method="POST",handler="/tracks",status="500",instance="<sample2>"} -> 28api_server_http_requests_total{method="POST",handler="/tracks",status="500",instance="<sample3>"} -> 31

Prometheus stores its data in the following format:

./data 01BKGV7JBM69T2G1BGBGM6KB12 meta.json 01BKGTZQ1SYQJTR4PB43C8PD98 chunks 000001 tombstones index meta.json 01BKGTZQ1HHWHV8FBJXW1Y3W0K meta.json 01BKGV7JC0RY8A6MACW02A2PJD chunks 000001 tombstones index meta.json chunks_head 000001 wal 000000002 checkpoint.00000001 00000000

Data is backed in a WAL (just like in InfluxDB). As it is an append-only log; it leads to compaction every 2 hours for saving space.

Query language

PromptQL is similar to a functional approach (Like InfluxDB's Flux); with various functions for selectors, functions, operators

PromptQL can also be used inside InfluxDB via Flux. But Flux not being supported in newer versions might lead to the risk of invalid PromptQL queries in the future.

Influx DB vs Prometheus

Scaling

As InfluxDB has a separate storage engine, it is horizontally scalable and decoupled from its storage engine (TSM). While Prometheus follows file-based data persistence, it is vertically scalable or via having a master Prometheus server that can pull data from Slave Prometheus servers.

Push(Influx DB) vs Pull (Prometheus)

Even though InfluxDB and Prometheus are useful tools in observing; their approach to gathering data is the opposite.

InfluxDB uses a push model, wherein the source of data can be pushed to InfluxDB. Prometheus on the other hand uses a pull mechanism that can pull various sources including InfluxDB, Mongo, Postgres etc. InfluxDB can also pull from various sources; albeit not natively. You can use Telegraf for this purpose.

Use Prometheus when:

Want a powerful querying language
High Availability for uptime of your monitoring platform
Pulling from various sources natively.

Use InfluxDB when:

Scale horizontally with separate database nodes (given you are OK with the Eventual consistency that comes when you do horizontal scaling)
Long-term data storage
High cardinality(higher number of tags) requirement for metrics.

Prometheus is a very powerful tool, but it is neither here nor there. It is a great aggregation tool and should be looked at as a TSDB. A great resource on this though can be found here by Ivan Velichko

JSON - A rabbit hole of standards, implementations

Ayman Patel — Sun, 08 Oct 2023 18:20:05 +0000

Why I got into this?

When developing an application, we were implementing an API that had a number key type. Easy peasy lemon squeezy (🍋). But what we realized was that there was a really weird and frustrating bug. Apparently, the Javascript implementation of JSON.stringify only works till 2^53 [Stackoverflow link], but we had to support for ~20 digits. The fix was to change it to string. We looked at the above link, made the fix and called it a day. Sidenote, Twitter also faced this issue 😅

Classic case of Integer Overflow, right? But this is so weird, considering this is the 2020s era and not the 1970s where we are not bound by 16-bit (or even less) CPU architectures, with GBs (and not KBs) of RAM. That is when I dug into the rabbit hole of JSON. I found a lot of things that surprised me. From different JSON formats to different RFCs for JSON implementation itself to whatever the frontend (JS, looking at you) world brings onto the table to fix JSON issues.

JSON Implementations

This is a web of implementations. A very apt blog is JSON Parsing is a minefield.

RFC 4627 can be looked at as a legacy RFC which has become obsoleted by #7159.

The only difference between RFC 8259 and RFC 7159 is that RFC 8259 has strict UTF-8 requirements while RFC 7159 can have UTF-16, UTF-32 in addition to UTF-8.

My reaction:

JSON Formats

Many formats augment JSON to create a kind of DSL.

The most popular format is JWT (JSON Web Token, and the rest of the JSON Cryptographic suite). This is the most popular so we can skip that for now, as other blogs would explain this in great depth and clarity. There are some that we use in our daily lives but we just know them or some format that is present for niche use cases.

This is the list I have come up with:

JSONSchema
GEOJSON
JSON-LD
Vega
NDJSON
HAR
JWT

I'll go through the first two as I think these are more important to know as a software engineer.

JSON Schema

When the world migrated from XML to JSON, the web was fine with making JSON "Schema-less". But as applications grew, we wanted to bring back schema so that there is some sanity with strong types/schema.

JSON Schema is for that purpose only. Bring back the schemas!

Building Blocks

Schemas

Allows to adhere to a specific schema. JSON Schema has various implementations such as Draft4, Draft5, Draft7

Types

Bring back the types!

it can also encapsulate inside subschemas.

List of types supported:

string
number
integer
array
boolean
nul

Validations

Rules on how JSON input can validate against the given schema. This can be of various types such as TypeValidations, Conditional, Regex Patterns etc

Example:

JSON Schema Example

{ // Can provide versions: draft-04, draft-05, draft-06, draft-07 "$schema": "http://json-schema.org/draft-04/schema#", "title": "User Profile", // Optional string for presenting to user "type": "object", // Validations "properties": { "userId": { "type": "integer", "description": "The unique identifier for a user" }, "firstName": { "type": "string", "description": "The user's first name" }, "lastName": { "type": "string", "description": "The user's last name" }, "email": { "type": "string", "format": "email", "description": "The user's email address" }, "phone": { "type": "string", "pattern": "^\\+?[0-9\\-\\s]+$", "description": "The user's phone number" }, "dateOfBirth": { "type": "string", "format": "date", "description": "The user's date of birth in YYYY-MM-DD format" } }, "required": ["userId", "firstName", "lastName", "email"]}

Valid JSON Input

{ "userId": 12345, "firstName": "John", "lastName": "Doe", "email": "johndoe@example.com", "phone": "+123-456-7890", "dateOfBirth": "1990-01-01"}

NOTE: Can try out different JSONSchema Draft versions here

$ref can also be used for the recursive schema. Hence, you can achieve compatibility by following the DRY(Don't Repeat Yourself) principle

Example Schema:

{ "$schema": "http://json-schema.org/draft-07/schema#", "title": "Personnel Record", "type": "object", "properties": { "firstName": { "type": "string" }, "lastName": { "type": "string" }, "address": { "$ref": "#/definitions/address" } }, "required": ["firstName", "lastName", "address"], "definitions": { "address": { "type": "object", "properties": { "street": { "type": "string" }, "city": { "type": "string" }, "state": { "type": "string" }, "postalCode": { "type": "string" } }, "required": ["street", "city", "state", "postalCode"] } }}

Valid JSON Input

{ "firstName": "John", "lastName": "Doe", "address": { "street": "123 Main St", "city": "Springfield", "state": "IL", "postalCode": "12345" }}

Usage

Used for API validation. There are multiple parser libraries in different languages that can help
Used in OpenAPI Swagger codegen specification to generate API validations for Swagger specifications
JSON Validation for Mongo Collections (Link)

GeoJSON

Wikipedia Link: https://en.wikipedia.org/wiki/GeoJSON

RFC Link: https://datatracker.ietf.org/doc/html/rfc7946

GeoJSON data format is used in Geographical applications which could be geospatial or web mapping apps. It is based on JSON and contains geographical features such as:

Points
Line Strings
Polygons

Example JSON:

{ "type": "FeatureCollection", "features": [{ "type": "Feature", "geometry": { "type": "Point", "coordinates": [102.0, 0.5] }, "properties": { "prop0": "value0" } }, { "type": "Feature", "geometry": { "type": "LineString", "coordinates": [[102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0] ] }, "properties": { "prop0": "value0", "prop1": 0.0 } }, { "type": "Feature", "geometry": { "type": "Polygon", "coordinates": [[ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0] ] ] }, "properties": { "prop0": "value0", "prop1": { "this": "that" } } } ]}

Database Support

Mongo provides query operations on geospatial data (stored as GeoJSON objects in collections). They also provide GeoSpatial Index for better read performance,
PostGIS, which is an extension of Postgres for storing geographic data has a function to function to query the geometric data as GeoJSON collection. (Reference)

Language Support

Frontend solutions

Things in frontend world are always awkward and weird. Maybe the hacking ethos still lives here. Trying out stuff, making it work, and just deviating from the rest of the world.

There are a couple of solutions (npm libraries) that help to solve some shortcomings of JSON

SuperJSON

Drop-in replacement for json.stringify and json.parse

Created by Blitz.js

Features

Safely serialize/deserialize unsupported JSON types like Date, BigInts, Map, Set, URL, and Regular Expressions.
Support Date and other Serialization for getServerSideProps and getInitialProps in Next.js

https://replit.com/@AymanArif1/Javascript-JSON-Enhancers#SuperJSONExample.js

Big Number Issue

Default JSON

Default Javascript implementation of json.stringify cannot parse more than 2^53-bit characters

We get an exception when trying to parse this. (Can check Replit's CLI)

SuperJSON

Users

tRPC: Data transformer when creating proxy client (Link)
Blitz.js (Superjson's creator)

JSON5

Provides features such as JSON comments. In the frontend world, JSON is normally used for configuration purposes. The most common is the use of package.json. Unlike all the different languages where the corresponding configuration has comment support, Node.js creator regretfully (after the fact) introduced package.json

Other features of JSON5 include:

Allowing single-quoted string
Strings that can span multiple lines
Broader number support which includes

Users

Babel
Next.js
Apple
Bun

Conclusion

This is a bit too much. I think part 2 would be required.

I haven't touched JWT, JSON-LD (RDF), NDJSON, Vega, Avro. 1 thing is for sure, things are never-ending rabbit hole that requires digging till the end of time.

Client Hello - Security all the way down

Ayman Patel — Sun, 01 Oct 2023 13:41:23 +0000

Cloudflare has been working on a standard to encrypt the TLS connection at all levels. We will dig as to what stuff is still unencrypted, how this poses the threat and the solution ClientHello

Pre-requisites

Before diving into the protocol, it is imperative to understand the historical context of this protocol

Existing TLS implementations contain a couple of parameters/extensions that are sent unencrypted and are used before the handshaking process. These are Server Name Identification(SNI) and Application-Layer Protocol Negotiation (ALPN)

While establishing a TLS connection between client and server, these parameters (SNI and ALPN) are not encrypted. It is after these 2 params are exchanged, that the client and server have enough security information (certificates, cryptographic keys .etc.) to initiate a Secure Connection

SNI

So what is SNI?

It is sent by the client indicating which hostname it wants to connect to before starting the TCP handshaking process
Since the ISPs can see which hostname the client wants to connect,it allows them to block websites easily.

ALPN

This provides information on what application-level protocol (HTTP version) should be used once the TLS connection has been established.

Having unencrypted ALPN is an attack vector that can be used to downgrade HTTP version and thus negate all security improvements for every HTTP version upgrade.

Since SNI and ALPN are considered metadata and can contain some sensitive information that is unencrypted, there has been a push to make this also encrypted. But this poses a chicken and egg problem, which is how can the client and server exchange encryption key before the handshaking process when the handshaking process itself is used for the same purpose?

This concept of encrypting before handshake was not considered in earlier TLS versions. But after the Snowden leak and the uproar on global surveillance with only using metadata information, IETF started considering ways to encrypt this information as well.

TLS Prerequisite

A prerequisite for any TLS connection is the TCP handshake. As it is a topic in itself, you can view it here. In simple terms, it does these things

Acknowledge both parties participating in TLS
Very each other
Establish cryptographic algorithms they will use to securely connect and exchange information with each other.
Specify which TLS version to use (depending on what is supported by each party)
Authenticate server via Servers public key and CAs signature

Basic TLS

The above is the initial SYN/ACK between client and server

After that, there is a ClientHello and ServerHello (refer to the below diagram)

ClientHello: Cipher Suite, TLS version client supports, client random, SNI and ALN info
ServerHello : Servers Sl certificate (with CA who issued it), servers cipher suite, server random

The rest of the steps can be viewed in the Cloudflare blog.

It is after these 2 params are exchanged, that the client and server have enough security information (certificates, cryptographic keys etc.) to initiate a Secure Connection

SNI

So what is SNI?

It is sent by the client indicating which hostname it wants to connect to before starting the TCP handshaking process
Since the ISPs can see which hostname the server wants to connect, it allows them to block websites easily.

ALPN

This provides information on what application-level protocol (HTTP version) should be used once the TLS connection has been established.

Having unencrypted ALPN is an attack vector that can be used to downgrade HTTP version and thus negate all security improvements for every HTTP version upgrade.

Since SNI and ALPN are considered metadata and can contain some sensitive information that is unencrypted, there has been a push to make this also encrypted. But this poses a chicken and egg problem, how can the client and server exchange the encryption key before the handshaking process when the handshaking process itself is used for the same purpose?

ESNI

ESNI was the first version of encrypting SNI. It evolved to what we have today which is ClientHello

ESNI issues

For key distribution, ESNI used DNS. And this keyboard distribution plain-text base-64 encoded for the ESNI Public key which would raise serious flags.

$ dig _esni.crypto.dance TXT +short"/wGuNThxACQAHQAgXzyda0XSJRQWzDG7lk/r01r1ZQy+MdNxKg/mAqSnt0EAAhMBAQQAAAAAX67XsAAAAABftsCwAAA="

This negates the whole security aspect of SNT as plain-text DNS is easily traceable by ISP. One innovation that helped mitigate this was DNS-over-HTTPS (DoH).

Final puzzle Piece: ECH

What ECH as the final draft proposes is to divide ClientHello, into 2 parts:

ClientHelloOuter: This contains information that is not sensitive such as what cipher suite is used, TLS version and outer SNI. This outer-SNI can show CDN-type hostnames which would be common for most sites using CDN for improving edge performance.
ClientHelloInner: This would include inner-SNI which would have an actual server name. This would be encrypted by a public key provided by Cloudflare. This could mean Cloudflare to be a point of vulnerability as they can decrypt by the private key that they possess.

ECH uses Hybrid Public Key Encryption (HPKE) for exchanging keys.

Reference: C Structs used in ECH

opaque HpkePublicKey;uint16 HpkeKemId; // Defined in <https://www.ietf.org/archive/id/draft-irtf-cfrg-hpke-05.txt>uint16 HpkeKdfId; // Defined in <https://www.ietf.org/archive/id/draft-irtf-cfrg-hpke-05.txt>uint16 HpkeAeadId; // Defined in <https://www.ietf.org/archive/id/draft-irtf-cfrg-hpke-05.txt>struct { HpkeKdfId kdf_id; HpkeAeadId aead_id;} ECHCipherSuite;struct { opaque public_name; // Entity trusted to update encryption keys HpkePublicKey public_key; // Public key to encrypt `ClientHelloInner` HpkeKemId kem_id; // Identifying public key ECHCipherSuite cipher_suites; // Cipher suite for encrypting `ClientHelloInner` uint16 maximum_name_length; Extension extensions; // } ECHConfigContents;struct { uint16 version; // Version of ECH for which this config is used uint16 length; // Length of next field (in bytes) select (ECHConfig.version) { // ECHConfigContents string case 0xfe08: ECHConfigContents contents; } } ECHConfig;ECHConfig ECHConfigs;

ECH in the real world

Setting up a flag in your browser

Check if your browser has ECH enabled using this link

(Chrome, can be achieved by enabling chrome://flags under encrypted-client-hello)

Before enabling ECH

After enabling ECH

No free lunch

Well this helps with securing exchanging metadata before TLS handshake, there are still some concerns/flaws:

Issues for corporate networks to implement firewall rules
OpenSSL open issue for 5 years Github Link
There is still not an official RFC number attached (despite Google bringing this to the browser and Cloudflare releasing their last puzzle piece to the privacy link.)

References

RFCs
Cloudflare blogs
YouTube talks