DEV Community: JOOJO DONTOH

Giving an AI Agent a Sense of What It Spends

JOOJO DONTOH — Tue, 30 Jun 2026 03:18:33 +0000

Intro

Friends and family, its me again. Last time I rambled about BYOC. Today I want to talk about something that bit me in the wallet which is tokens, the cost and what I built to help me clarify this for Alfred.

Before we start I'd like to briefly define a few things around tokens

A token is a chunk of text. Its easy to just assume a token is a word or a letter but it is somewhere in between. "cat" might be one token, "tokenisation" might be three. Models do not read characters, they read these chunks, so the token is the unit everything is measured and billed in.
For an LLM, tokens are both the input and the output. Your prompt gets turned into input tokens, the model's reply comes back as output tokens, and the two are usually priced differently, with output costing more. Everything the model "sees" and "says" is counted this way.
Token usage gets fuzzy fast. You cannot eyeball it from the text, because the same sentence tokenises differently on different models. (Opus 4.7 and later even shipped a new tokenizer that can produce up to 35% more tokens for the same fixed text.) Then prompt caching adds cached reads and cache writes that are priced on their own scale, retries quietly double a call, and a single "agent" turn can fan out into many calls under the hood. So "how many tokens did that cost me" is rarely a number you can guess.
The Claude API itself is refreshingly blunt about it. You send your messages to the model, you get a reply back, and that reply carries the token counts already filled in: the input tokens, the output tokens, and the cache figures. You do not have to count anything. The model hands you the receipt with every response.
And the billing is pure pay-as-you-go. There is no monthly bucket of "requests" like a SaaS seat. Every token is metered, priced per million tokens, and added to your bill. Cost scales directly and linearly with usage, which is wonderful when you are small and terrifying the first time something runs in a loop.

Which brings me to how I learned this properly. I have a personal agent, Alfred, and one of the things he does is classify my incoming email. At some point I thought it would be neat to backfill. So to basically point the classifier at my inbox and have it label everything, going back about eight years. I wrote a little scheduler, set the start date to roughly 2018, and walked away feeling clever. Nah bro

That job churned quietly through eight years of email, one classification call at a time, and it cost me around thirty dollars a day until I happened to open the billing dashboard and see the line going up like a staircase. Nothing was broken. The code did exactly what I told it to. I just had no visibility into what "classify everything since 2018" actually meant in tokens until the money had already left.

That is the whole reason for this article.

Problem

The annoying part is that the information exists. It is just never where I need it.

Token visibility

On the fixed subscription plans, tokens are easy to see. If you use Claude Code, you run /usage in the CLI and you get a genuinely nice breakdown:

Current session     ████████░░░░░░  5% used   · resets 4pm (Asia/Kuala_Lumpur)
Current week        ███████████░░░  13% used  · resets Jun 30, 1pm

What's contributing to your limits usage?
  89% of your usage was at >150k context
  41% came from subagent-heavy sessions
  24% came from sessions active for 8+ hours
  21% was while 4+ sessions ran in parallel

A few things worth knowing about that view:

It is fed by a local stats cache. Claude Code keeps a usage/stats file on your machine that refreshes periodically against a rolling cut-off, then renders these numbers. It is explicitly best-effort and local, it says so itself: it is based on sessions on this machine and does not include your other devices or claude.ai. So it is a snapshot, not a ledger.
It shows limits, not money. The subscription plans are a different pricing model from the API. There is no dollar figure in /usage, because on a fixed plan you are spending against rate limits, not against a per-token bill. What it gives you is rate-limit and quota headroom, which is the thing that actually matters on that plan.
The API is the mirror image. On the API you are spending real money per token, and you can see usage and rate-limit information, but it lives on the Console dashboard. For a long time none of it was reachable programmatically at all.

Now the honest update, because this moved while I was building. Anthropic has since shipped an Admin Usage & Cost API (the /v1/organizations/usage_report/messages and /v1/organizations/cost_report endpoints). But it does not close my gap, for three reasons:

It is organisation-level and aggregated, bucketed by workspace, API key, model and service tier. It does not know what a "feature" is inside my app, so it can tell me Workspace X spent money on Haiku, but not that the email classifier did.
It is delayed and admin-gated. It needs an Admin key, and that key is a team or enterprise feature, on a plain individual account you cannot even create one without converting your whole organisation to a team first, and the data lands in time buckets, not the instant a call returns. It is built for FinOps reporting, not for an agent watching its own spend in near real time.
There is still no prices API. Even with the cost endpoint, if I want to compute cost myself, per call, the actual per-model prices are not something I can fetch as structured data. They live on a pricing page meant for humans.

Cost

The cost story is the same shape. The real dollar figures live on the Console dashboard (and now, in aggregate, behind the Admin cost endpoint). That is fine for a monthly finance review. It is useless for what I actually want, which is:

cost that sits right next to Alfred, attributable per feature and per call, with best-effort precision, so I can see at a glance which of his background flows are quietly burning money, the way the eight-year email backfill did.

Nobody is going to hand me that. So I built it.

Solution

The design lives in two diagrams I drew while figuring this out: a high-level "idea" map, and a clean-architecture map of how it actually wires together. Let me synthesise both here.

The idea, in six steps

Strip away the architecture and the whole thing is a short pipeline. Each call to the model flows left to right:

  1.call   ──►  2.capture   ──►  3.price   ──►  4.append  ──►  5.aggregate ──►  6.show
  the AI        the tokens       tokens         one row        by model /        a dashboard
                from the         × current      append-only    feature /         view, or just
                response's       price,         never edited   time, when        ask Alfred
                receipt          frozen on      (local sqlite) asked             "how much
                                 the record                                       this month?"

Two of those steps carry all the weight, so they get their own rule later: capture has to happen at exactly one place, and price has to be frozen at the moment of the call, not recomputed later.

Keeping the price honest

Step 3 says "current price", and that is the part that keeps all of ALfred's usage cost meaningful. Prices change. As a concrete example, in March 2026 Anthropic removed the old long-context premium that doubled the rate past 200K input tokens. If your cost tracker had that 2x baked in as a constant, every number it produced after that date was wrong.

Since there is no prices API, the design fetches and refreshes prices on its own little loop:

   weekly tick ─┐
                ├─►  fetch the public      ─►  LLM structures   ─►  validate /
   POST refresh ┘    pricing page (text)       it into data         sanity-check
                                                                         │
                                                          ok ───────────►│
                                                                         ▼
                                                              replace the price cache
                                                                         ▲
                                          on ANY error ──► keep last good prices
                                                            (never break tracking)

The choice to parse the pricing page with the LLM rather than a regex is deliberate because a pricing page is a moving target of tables and footnotes, and a brittle scraper breaks the week they reword a heading. Handing the raw text to the model and asking for clean structured data is reactive instead of fragile. Then you validate the result, because an LLM can hallucinate a number, and if validation fails you fall back to the last good table. The refresh runs weekly on a schedule and on demand when I ask.

Three principles worth keeping

Everything else is implementation detail, but these three are the load-bearing walls:

Capture at one choke point. Wrap the AI call in a single function that every flow goes through, and do the recording there. If recording is automatic, you can never forget to log a call, and "forgot to log it" is how you end up blind to an eight-year backfill.
Freeze cost when you record. Store the dollar figure at the instant of the call, computed against the price that was live then. Past spend must stay correct even after prices change. You are writing history, not a formula to be re-evaluated.
Fail safe. A broken price refresh must never break cost tracking. Worst case, you keep pricing slightly stale and carry on. The log keeps filling either way.

The shape: ports and adapters

The clean-architecture diagram is just those principles given a skeleton. It is a hexagonal layout, the application core depends only on its own ports (interfaces), and the messy outside world (SQLite, the Anthropic API, the pricing page) plugs in as adapters that implement those ports. The dependencies point inward.

  INTERFACE         HTTP usage-handler            chat tools (Alfred's hands)
  agent-server      GET  /usage/summary/:month    api_usage_summary
                    GET  /usage/trends            api_usage_trends
                    GET  /usage/prices            refresh_api_prices
                    POST /usage/prices/refresh
                          │  calls
                          ▼
  APPLICATION       use cases   GetUsageStats          RefreshPricing
  pure, no IO       ports       UsageReaderPort        PricingStorePort
                                UsageRecorderPort      PricingSourcePort   LlmPort
                          ▲  implemented by (dependency inversion)
                          │
  INFRASTRUCTURE    SqliteUsageLog            SqlitePricingStore (+ bundled fallback)
  adapters          recordUsage + computeCost  AnthropicPricingExtractor
                    AnthropicLlmAdapter        Worker scheduler (weekly tick)
                          │  talks to
                          ▼
  EXTERNAL          api_usage_log   model_pricing   Anthropic Messages API   pricing page
                    (sqlite)        (sqlite)

And three flows move through it, which is the clearest way to read the diagram:

Capture flow: a call goes through the choke point, cost is frozen, one row lands in the usage log.
Read flow: the dashboard or a chat tool asks for stats, the read use case pulls the log through the reader port and aggregates.
Pricing-refresh flow: the weekly tick (or a manual refresh) runs the pricing job, which fetches, structures, validates and swaps the price cache.

Let me walk through the parts that matter, in plain terms.

The ports

The whole core is defined by a handful of contracts, plain promises about what must be possible, with nothing said about how. Nothing here knows SQLite or HTTP exists. It really comes down to two shapes and five small capabilities.

The two shapes:

a usage record, which is one logged call: when it happened, which model, which feature label I gave it, the four token counts (input, output, cache read, cache write), the dollar cost frozen at that moment, and a note of which price table produced that cost.
a model price, which is the four per-million-token rates for one model (input, output, cache read, cache write) plus a version stamp.

The five capabilities, each one a slot the outside world plugs into:

record one usage row.
read every row between two dates.
price store, which can hand back the current price for a model instantly, and replace the whole table in one go.
price source, which can fetch the raw human pricing page as text.
structure, which turns that messy text into clean data. That last one is the LLM's job.

That is the entire core. Everything below is just something filling one of those slots, and because the core only ever talks to the slots, I can swap what fills them without touching it. I did exactly that three times while building this, scraper, then LLM, then I looked hard at the official Admin API, and the core never moved.

The choke point (capture + freeze)

This is principle 1 and principle 2 in one place. Every flow in Alfred calls the model through one shared function, never the raw SDK, so recording can never be skipped. The important detail is the feature label that gets passed in, that single string is what later lets me say "the classifier cost X". In pseudo-code:

to call the model (model, feature, messages):
    response = ask the model for a reply
    price    = current price for this model
               (or the bundled fallback, if we have not fetched one yet)

    append one usage row:
        when    = now
        model, feature
        tokens  = the counts from the response's receipt
                  (input, output, cache read, cache write)
        cost    = those tokens priced against `price`, frozen right here
        version = which price table we used

    return the response

And the cost itself is just arithmetic, no cleverness: take each token count, divide by a million, multiply by that token type's per-million rate, and add the four together. The only thing that makes it trustworthy is when it happens, against the price that was live at that instant, written down once and never recomputed.

The log (append-only) and the price table

SQLite is more than enough for this. There are two small tables.

The usage log is write-once, one row per call, never edited. Each row holds an id, a timestamp, the model, the feature label, the four token counts, the frozen cost, and the price version that produced it. It is indexed by time and by feature, because those are the only two ways I ever slice it.

The price table is tiny, one row per model with the four per-million rates plus a version and a fetched-at time. It is not append-only like the log, it gets thrown away and rewritten wholesale every time prices refresh, so it always reflects the latest known prices and nothing older.

Adding it up

Because cost is already frozen per row, "where is my money going this month" is one boring lookup. In plain English: take every row inside this month, group them by feature, and for each feature add up the cost, add up the tokens, and count the calls, then sort with the most expensive feature on top. That is the whole thing. There is no pricing maths at read time, just sums, because the dollar figure is already sitting in every row.

That single lookup is what would have caught my backfill on day one: the email classifier sitting at the top of the list with a number climbing by thirty dollars a day.

The pricing refresh (fail-safe)

This is the loop from earlier, spelled out. The one rule that matters is the catch-all at the bottom, any failure keeps the last good prices.

to refresh pricing:
    try:
        raw        = fetch the public pricing page as text
        structured = ask the LLM to turn it into clean data
        table      = validate and sanity-check that structure
        if the table came back empty, stop and keep what we had
        replace the price cache, all at once, with the new table
        report success

    on ANY error:
        log a warning, keep the last good prices, report failure

The whole design hangs off that last branch. A bad fetch, a reworded page, a hallucinated number that fails the sanity-check, none of it touches the prices already in the store. The ledger keeps filling either way, just against slightly older prices until the next good refresh.

Giving Alfred his own eyes

The last move is what makes this fun. The same read-the-stats path is exposed as a tool the model can call, so Alfred can read his own ledger when I ask him in plain language.

There is barely anything to it. The tool is described to him in one sentence, return this month's API cost and tokens, broken down by feature and model, for an optional month, and when he decides to use it, the call runs straight through the very same read path the dashboard uses: read the log, aggregate, answer. Same numbers, reached a different way.

Now "Alfred, how much have you cost me this month and on what?" is a question he can actually answer, from the same numbers the dashboard shows, sitting right where I already am.

Advantages and benefits

Building this out gave me more than a number on a screen.

Cost becomes attributable by functionality. Because every call is tagged with a feature, I can see that classification is cheap per call but runs constantly, while planning is rare but expensive per call. That tells me exactly which flows to tune, cache harder, or move to a smaller model, instead of guessing.
The numbers are accurate on purpose, not by luck. Prices are refreshed on a schedule and the cost is computed deterministically and frozen at the moment of the call. There is no fuzzy estimate and no constant that silently goes stale when Anthropic changes a rate. History stays true even as the present moves.
Alfred can introspect. Exposing the stats as tools the model can call means the agent can analyse its own usage in conversation, and in time, learn from it. The data is already in a shape he can reason over.
It opens the door to self-regulation. Once an agent can read its own spend, it can act on it. The natural next step is letting Alfred flag himself, or throttle a flow, when he notices he is burning money faster than expected. A live backfill that crosses a threshold could pause itself and ask me, rather than running for days.
Cost introspection becomes a first-class sense. Most agents are blind to what they cost. Giving Alfred a running, queryable feel for his own spend makes cost something he is aware of while he works, not a bill that arrives later.
And the foundation compounds. A clean, append-only ledger with per-feature attribution and honest pricing is the kind of base layer you build a lot of things on. Budgets, alerts, anomaly detection, per-flow optimisation, learned cost models. The possibilities genuinely feel open-ended now that the bones are in place.

Conclusion

The short version of all this: tokens are the unit, cost rides on tokens, and the API bills you for every one of them in real time. The information you need to stay sane about that exists, but it is scattered across a CLI that shows limits not money, a dashboard you have to go and look at, and an Admin API that aggregates at the org level after the fact. None of it sits where an agent actually works.

So I built the missing piece: capture every call at one choke point, freeze its cost against freshly-fetched prices, append it to a local SQLite ledger that is never edited, and read it back by feature, by model, over time, from both a dashboard and Alfred's own tools. The architecture is just three principles wearing a hexagon, capture in one place, freeze on write, fail safe on pricing.

It started because a clever little eight-year email backfill quietly cost me thirty dollars a day while I had no idea. It ended with an agent that can tell me what he costs and, soon, decide to do something about it himself.

As always, this is my read and my build, not gospel. If you have wired up something similar and made different calls, I would like to hear it.

Who Is a True BYOC PaaS, and How Do They Actually Work

JOOJO DONTOH — Thu, 25 Jun 2026 10:38:04 +0000

Intro

Friends and family, its me again. Today I want to talk about an area of engineering I am curious about. This area is BYOC which means Bring Your Own Cloud. BYOC has quietly become one of the most overloaded terms in platform engineering. Ask ten vendors what it means and you will get ten answers, and a few of those will flatly contradict each other. So before I get into who I think is doing this differently from the other and how the machinery holds together, I want to be upfront about two things:

This is my read on it, not a spec. It's simply what Joojo thinks. A lot of what follows came together while I was researching the space, and I leaned heavily on material from Northflank, Render and a handful of other vendors and write-ups along the way. Where I've landed on a definition or drawn a line, that is me synthesising what I found and forming a view, so treat it as one engineer's mental model rather than the final word.
Here is the working definition. BYOC, or bring your own cloud, is a deployment (and sometimes reconciliation) model where a vendor's control plane manages and runs software inside the customer's own cloud account, so the workload and its data stay on the customer's infrastructure and bill, while the vendor operates the experience from outside. That is the whole idea in a single sentence.

What BYOC might not be

I find it easier to pin down BYOC by ruling things out first, because the marketing pages have stretched the term so far that the cleanest way in is through the back door. So here are three things that, to my mind, quietly disqualify a vendor from the true version of this, even when the website says otherwise.

If the vendor's product can see your data, it is not BYOC. This is the big one. The moment your data passes through the vendor's systems, or sits somewhere they can read it, the whole premise has already broken. It does not matter that the workloads happen to run in your account if the bytes are still visible on their side of the fence.
If the vendor terminates your requests before they reach your VPC, it is not BYOC. Picture traffic hitting the vendor's load balancer or gateway first, getting processed there, and only then being forwarded into your network. That front door belongs to them, not you, and anything that lands on it has left your boundary before your account ever sees it.
If the infra costs are billed to the vendor and then the vendor bills you, it is not BYOC. The whole point is that your cloud provider invoices you directly for the compute and storage your workloads consume. The second that spend gets routed through the vendor and rebilled to you, they are back to being a middleman reselling infra, which is the exact arrangement BYOC is meant to get you out of.

The request-path one is the easiest to picture:

TRUE BYOC      user ──► your VPC ──► your workload
               (vendor never sits in the path)

NOT BYOC       user ──► vendor gateway ──► your VPC
                              ▲
                        terminated here,
                        outside your boundary

Now the important caveat. None of these three are bad in and of themselves. Plenty of excellent products see your data, terminate your requests, or resell infra, and for a lot of use cases that is completely fine and often the better choice. I am not knocking any of it. The point is narrower than that: in the true essence of BYOC, none of these things are happening, and if one of them is, you are looking at something else wearing the BYOC label.

The two halves: control plane and data plane

Everything in a BYOC setup sits on one side of a line, and the line is drawn by a single question: who can see it, and whose cloud is it running in. That split has a name in two parts, the control plane and the data plane, and getting clear on which is which is the thing that makes the rest of this article make sense.

        VENDOR SIDE                                YOUR CLOUD
  ┌───────────────────────┐                ┌───────────────────────┐
  │   CONTROL PLANE        │                │   DATA PLANE           │
  │   • UI                 │   permission   │   • nodes / workloads  │
  │   • APIs               │   grants       │   • secrets            │
  │   • observability      │  ───────────►  │   • application code   │
  │                        │                │   • databases          │
  └───────────────────────┘                └───────────────────────┘
                                  ▲
                          network boundary
                    (your data never crosses it)

The control plane stays with the vendor. Think of it as the brain and the dashboard rather than the workload itself. It is usually some combination of three things: the UI you log into, the APIs you call to tell the platform what you want, and the observability layer that shows you what is happening. None of this touches your actual data. It is the experience of operating the platform, and it lives on the vendor's side because that is the thing you are paying them to run well.
The data plane stays with you, inside your own account. This is where your real stuff lives: the nodes your workloads run on, the secrets they depend on, your application code, the databases, all of it. Everything that actually processes or holds your data sits here, behind your network boundary, billed to you directly. The vendor can orchestrate it and reconcile it, but in the true version of this they are working through permissions you granted rather than reaching into a box they own.

The way I find it useful to think about it is as two questions, not one. Who operates the control plane, the vendor or you? And whose cloud is the data plane running in, theirs or yours? Plot those two against each other and you get a map that tells you very quickly whether something deserves the BYOC label at all.

                       CONTROL PLANE: VENDOR
                               ▲
   data plane in VENDOR cloud  │  data plane in YOUR cloud
   ┌───────────────────────────┼───────────────────────────┐
   │  MANAGED SaaS / PaaS       │  TRUE BYOC                 │
   │  Heroku, Render,           │  Northflank, Porter,       │
   │  Vercel, Fly.io            │  Qovery, Flightcontrol     │
   ├───────────────────────────┼───────────────────────────┤
   │  ~ empty                   │  SELF-HOSTED / DIY         │
   │  (run the brain AND        │  EKS Anywhere, GKE         │
   │   rent the data plane?)    │  Enterprise, OpenShift     │
   └───────────────────────────┼───────────────────────────┘
                               ▼
                       CONTROL PLANE: YOU

Top-right is true BYOC. The vendor operates the control plane, so you get the managed experience, but the data plane runs in your cloud. That is the combination the whole model is built around. Northflank, Porter, Qovery and Flightcontrol live here, alongside the orchestrators like Humanitec and Massdriver that provision but do not run your workloads. And this is the corner where my disqualifiers from the last section bite hardest. It is only genuinely BYOC if the control plane stays out of the request path and never sees your payloads. The moment it does, the thing has quietly slid out of this corner.
Top-left is managed SaaS and PaaS. The vendor runs the control plane and the data plane lives in their cloud too. That is Heroku, Render, Vercel, Fly.io, the default tiers of things like Railway. Excellent products, just not BYOC, because your data never left their side.
Bottom-right is self-hosted and DIY. You run both the control plane and the data plane yourself: EKS Anywhere, GKE Enterprise, OpenShift, self-hosted Crossplane, Terraform and Pulumi run on your own.
Bottom-left is almost always empty, for a reason worth sitting with: if you are already operating the control plane yourself, you would just host the data plane too. Renting the vendor's data plane while running your own brain drags their trust boundary right back in, which defeats the entire point of going BYOC in the first place.

One thing to watch for, because vendors do it. Single-tenant SaaS running in the vendor's own cloud sometimes gets marketed as BYOC. On this map it really lives top-left. The data plane is still in their account, no matter what the pricing page calls it.

The two axes that actually define a BYOC product

Here is where it gets interesting, and where I want to be careful with a word. Earlier I used "plane" for the control plane and data plane, the split of who sees what. This next idea is different, so I am going to call these two things axes rather than planes, to keep them apart in your head. These are the two decisions that, more than anything on a pricing page, determine what a BYOC product actually is, and knowing them helps you read any vendor more clearly.

                         RECONCILED
              (vendor holds it to spec, fixes drift)
                             ▲
   PORTABLE                  │                  NATIVE
   one contract,             │             expose each
   many clouds               │             cloud's own knobs
   ◄─────────────────────────┼─────────────────────────►
   point at a generic        │             point at an
   external interface        │             external truth
   (sparse)                  │             (secret value, org IAM)
                             ▼
                         REFERENCED
            (someone else owns the spec, you adapt)

Axis one: how abstract is the contract

The first axis is the level of abstraction in the contract the platform hands you. This looks like a UX question and it is not. Where a vendor draws the portable-versus-native line is their build surface, their test matrix, their support load and their release cadence all at once. It is an engineering and economics decision wearing a UX costume, and it shapes what you can actually do with the product.

At one end sits the portable contract. It abstracts away the native knobs and maps the same spec across different products in different clouds:

service becomes Cloud Run on GCP, ECS on AWS, or Container Apps on Azure
job maps to Cloud Run Jobs, AWS Batch, or ACA Jobs
identity becomes a service-account binding, an IAM role, or a managed identity, depending on where it lands

A portable service spec looks roughly like this:

{
  "service": "checkout-api",
  "image": "registry/checkout:1.4.2",
  "cpu": "1",
  "memory": "2Gi",
  "port": 8080,
  "scaling": { "min": 1, "max": 10 }
}

Nothing in there names a cloud. The vendor takes that and translates it into whatever the target provider needs. One contract, three backends.

At the other end sits the native contract, also called native passthrough. Here the spec includes attributes that only exist on one specific cloud. As a power user you might want exactly a Lambda function with 3008 MB of memory and a ten-minute timeout, or exactly a Cloud Run service with a minimum instance count of one, and you do not want the vendor abstracting that away. So the platform lets you write it straight through:

{
  "service": "checkout-api",
  "image": "registry/checkout:1.4.2",
  "aws": {
    "lambda": {
      "memoryMB": 3008,
      "timeoutSeconds": 600,
      "architectures": ["arm64"]
    }
  }
}

That is great when you know the AWS playbook and want the vendor out of the way. But it can go south fast, and this is the part worth sitting with. If a vendor's headline promise is portability, the instant you set a native AWS parameter, that workload stops being portable. The one-contract-across-clouds promise silently dies for that resource. The product is now really two experiences in one: a portable lane that moves between clouds, and a native lane that does not, and you can wander from one into the other without ever realising you just gave up portability. So when you read "works across any cloud" on a landing page, the question to ask is what happens the moment you reach for a cloud-specific knob.

Axis two: who owns the source of truth

The second axis is ownership of the reconciliation source of truth, and this is where most of the real value sits, because it is hard to build and hard to leave. When a vendor gets it right it is a genuine moat, which also means it is the thing you most want to understand before you commit.

The vendor owns the spec and enforces it (reconciled). You tell the vendor what you want, the vendor provisions it, and then the vendor keeps it that way. The onus is on them to hold the resource to the spec, possibly reconciling forever, correcting drift every time reality wanders away from what you asked for.
Someone else owns the spec and the vendor adapts (referenced). Here the vendor does not hold the spec at all. It either does not exist on their side, or somebody else owns it. A secret value is the cleanest example. The vendor cannot reconcile the value of a secret, because the spec for that value, the rotation policy, lives with you or an external rotation system. What the vendor can reconcile is the grant, the permission a given resource has to read that secret. When the secret rotates, the reconciliation either adapts because the authenticating resource is told about the change, or it breaks. The vendor is pointing at a truth they do not control and adapting to it, rather than enforcing a truth they own.

The quadrant, and why it is a product-shape question

Put those two axes against each other and you get a quadrant. This is the management map, and the important move is realising it is not only a per-resource decision. It is a product-shape decision. Asking which quadrants a vendor operates in at all tells you what kind of business they are, not just how they handle one resource.

And a vendor can absolutely build a real business in the half that provisions without reconciling. That is exactly what one-shot IaC generators and setup wizards are. Pulumi and Terraform Cloud largely live there: they apply a spec, and between applies they do not hold your infrastructure to it unless you go out of your way to wire up continuous drift detection. Provisioning once and reconciling forever are different products with different cost structures. The reason this matters to you as the person choosing is simple: it is easy to assume a tool will keep your infrastructure in the state you asked for, when all it really promised was to set it up once and walk away.

Who is actually doing this

So who passes the test? Here are the players I kept coming back to during my research, with the caveat that this is a fast-moving space and the lines move, so check the pricing pages before you commit to anything.

The true players

Northflank is the one with the broadest coverage and the easiest on-ramp. It runs across six providers at last count, AWS, GCP, Azure, Oracle, CoreWeave and Civo, plus on-premises and bare metal, and it is self-serve, so you can get going without a sales call. If you want the full deployment lifecycle in your own cloud, CI/CD, managed databases, preview environments, GPU workloads, this is the one that tries to cover all of it in one place.
Porter takes a different shape. It deploys a Kubernetes-based platform into your AWS, GCP or Azure account and gives you a Heroku-like experience on top, so you get the managed feel without having to operate Kubernetes yourself. One number worth knowing going in: because every cluster carries the EKS control plane overhead, there is roughly a $225 a month floor on AWS before you have deployed anything meaningful. That is not a knock, it is just the shape of the bill, and it tells you Porter is aimed at teams past the hobby stage.
Flightcontrol is the narrow, opinionated one. It is AWS only, it runs on ECS rather than Kubernetes, and it leans on standard AWS primitives like CloudFormation and IAM role assumption. If you are all-in on AWS and you do not want Kubernetes anywhere near your stack, the narrowness is the feature, not a limitation.

The three buckets that get called BYOC but are not quite

Then there is everyone else who gets called BYOC but is really doing something adjacent. I found it useful to split them into three buckets, because each one is a perfectly good tool that simply answers a different question.

Portal and orchestrator only (Backstage, Port, Cortex). These are developer portals and catalogues, the front door to your platform, but they have no data plane of their own. They point at infrastructure that already exists, they do not stand it up in your cloud, so plotting them on a BYOC map does not really work. They are solving the discovery and self-service problem, not the run-it-in-your-account problem.
Data-service BYOC (Aiven is the clearest example). Here the BYOC idea is real but scoped tightly to data infrastructure: managed databases and streaming, Kafka, ClickHouse and friends, running inside your own VPC. If your whole reason for wanting BYOC is keeping the database in your account, this is the specialist lane, but do not expect it to deploy your application tier.
Hyperscaler-managed (EKS Anywhere, GKE Enterprise, formerly Anthos, and AKS Arc). These give you a managed Kubernetes control plane that can run on your own infrastructure, which sounds like BYOC and in a narrow sense is. The catch is that you take on the operational load yourself. You get the Kubernetes plumbing without the PaaS experience sitting on top, so the developer-facing comfort that drew people to BYOC in the first place is the part you now have to build or do without.

The reason I bother separating these is not to crown winners. It is that "BYOC" on a landing page tells you almost nothing until you ask two questions: does it actually stand the workload up in my account, and does it give me a managed experience while doing so. The true players answer yes to both. The three buckets each say no to one of them, and which one they say no to tells you exactly what you are getting.

Why anyone signs up for this

BYOC is not free. You take on real operational weight, so the only sane reason to do it is a specific problem it solves better than the alternatives. Three keep coming up.

Committed cloud spend

This is the one your CFO raises: why pay a SaaS markup for compute when you have already prepaid for hyperscaler capacity sitting unused?

The big committed-spend contracts are why it bites. On AWS, the Enterprise Discount Program, now folded under the Private Pricing Agreement label, is a negotiated contract where you commit to spending a set amount over one to three years in exchange for discounts, usually starting around a million dollars a year. The catch is that the commitment is a floor, not a target. Commit to three million and consume 2.4, and you owe the 600,000 difference at year end regardless. So when a normal PaaS offers to run your workloads at their markup, billed on top of everything you already owe AWS, the question writes itself. BYOC keeps the workloads in the account you are already committed to and charges you only for orchestration, not compute. The prepaid dollars get used instead of forfeited.

Data residency and compliance

The cheapest way to satisfy a residency rule is to put the data plane in the required region and prove the bytes never left, which BYOC does almost by definition.

The pressure is real. Cumulative GDPR fines since 2018 now sit north of seven billion euros, with about 1.2 billion of that in 2025 alone. And the fines are not even the sharp end: companies appeal the headline number but almost always implement the corrective orders, like Meta having to localise its EU data processing. The Gulf has moved fast too, with Saudi Arabia now actively enforcing rules that require sensitive and personal data to stay inside the Kingdom unless a specific exemption is granted. For regulated tenants, healthcare especially, this is the clincher. "Trust us, it is safe on our servers" does not fly with a hospital's compliance team. "Your data never leaves your own account, here is the boundary proving it" does.

Reserved GPUs

The newest driver, and to my eye the fastest-growing. Teams reserve H100s or B200s on long commitments, then need a real platform pointed at hardware they already own. The economics reward BYOC hard: an idle H100 burns three to eight thousand dollars a month, and average GPU utilisation across 23,000 enterprise Kubernetes clusters measured in 2026 was about five percent. Renting a second GPU fleet from a PaaS on top of your reservation makes no sense. You want a platform that points at the capacity you hold and helps you actually use it.

The one-sentence test

BYOC is operational complexity traded for a specific benefit:

Committed spend you would otherwise forfeit
A residency rule you must satisfy
A reservation you need to use

If you can name your benefit in one clean sentence like those, you probably have a real case. If you cannot, and you are reaching for BYOC because it sounds more serious than a normal PaaS, you are about to make your life a thousand times harder for nothing.

The business, and why it is a different animal

The technical model has a financial shadow, and it is worth understanding because it explains why BYOC vendors behave differently from the PaaS you are used to. The short version: BYOC quietly removes the main way platform companies have always made money, and forces them to make it somewhere else.

Dimension	Normal PaaS	True BYOC
Who bills the compute	The vendor, resold to you	Your cloud provider, directly
Where the margin comes from	30 to 50% markup on compute	A fee for orchestration and support
Cost of goods that scales with your usage	Yes	No
How visible the platform fee is	Hidden inside a blended bill	A naked line item next to your AWS bill

Where the old margin goes to die

A normal PaaS makes money on a markup. It buys compute wholesale from a hyperscaler, often at a committed-spend discount, runs your workload on shared infrastructure, and resells that capacity to you at a margin, usually somewhere in the 30 to 50 percent range. The compute is the product, and the spread between what they pay AWS and what they charge you is the business. The more you run, the more they make, because every CPU-hour you consume carries their markup baked in.

True BYOC breaks that model on purpose. Because the data plane sits in your cloud account, the compute and memory are billed directly to you by your own provider. The vendor never touches that money. They are not buying capacity and reselling it, so there is no spread to capture and no markup to grow. The thing a normal PaaS sells, infrastructure at a margin, is exactly the thing a BYOC vendor has given away. They have voluntarily stepped out of the middle of the compute transaction, which is the single biggest structural difference between the two business models and the reason you cannot just assume a BYOC vendor prices like a PaaS.

The revenue that replaces it

So what is left to sell? Orchestration, reconciliation, management and support. The vendor is no longer charging you for the compute; they are charging you for making the compute easy to run, keeping it in the state you asked for, and being on the hook when it drifts or breaks. That is a real product, but it is a fundamentally different one, and it forces a different pricing model. You see things like:

per-seat pricing
per-environment or per-cluster fees
flat platform tiers
usage metrics tied to deployments or managed resources rather than raw compute consumed

The pricing has to attach to the value the vendor actually delivers, because the value they used to deliver, cheap-ish blended compute, is no longer theirs to sell.

In some ways this is a cleaner business. There is no cost of goods that scales with your usage, so the vendor is not quietly bleeding margin every time your traffic spikes, and a runaway workload on your side does not blow up their gross margin. But it is a much harder business to price, and the difficulty is psychological as much as economic. In the old model, compute and markup were blended into one bill the vendor controlled, so you never saw the seam. In BYOC, you can see exactly what you pay the vendor and exactly what you pay AWS, side by side, as two separate line items. The platform fee is now nakedly visible next to the infrastructure bill, and the customer is constantly, quietly asking whether the orchestration is worth what it costs relative to the raw infrastructure sitting right next to it on the invoice. A PaaS never has to win that comparison because the customer never gets to make it. A BYOC vendor has to win it every billing cycle.

That visibility cuts both ways, and it is worth knowing as a buyer. It keeps a BYOC vendor honest, because they cannot hide a fat margin inside a compute bill, but it also means their pricing has to be legible enough to survive being stared at. When you evaluate one, the real question is not "is this cheaper than a PaaS." It is "is the orchestration layer worth a line item I can see, given the infrastructure bill I would be paying anyway." If a vendor cannot make that case clearly, the transparency that should have been their honesty becomes their problem.

If you want to feel the bones for yourself

Reading about BYOC only gets you so far. The two axes from earlier, abstraction and reconciliation, stay abstract until you have felt the permissions fight and the drift problem with your own hands. So if this is a model you want to actually understand rather than just evaluate, here is the path I would point you down, roughly in order.

Stand up a minimal control plane (or a Crossplane composition) and deploy a workload from it into a second cloud account using a cross-account role. This one exercise teaches more than any amount of reading, because it forces you to confront the two hard parts at once. You feel the permissions problem directly: the control plane is over here, the workload has to land over there, and the only thing connecting them is a role you have to scope correctly without handing over the keys to everything. And you feel the drift problem the moment something in that second account changes out from under you and your control plane has to notice and respond. That gap between "I asked for this state" and "reality has wandered off" is the whole game, and you cannot really internalise it from a diagram.
Write a CRD and a controller with controller-runtime. A custom resource definition plus a controller is the smallest honest version of "I own a spec and I reconcile it." You declare what a resource should look like, you write the loop that keeps checking whether reality matches, and you handle what happens when it does not. Doing this once demystifies the entire reconciliation half of the second axis, because you stop seeing reconciliation as a vendor's magic and start seeing it as a loop you yourself have written.
Study Crossplane as the canonical universal control plane. It is the clearest worked example of the portable-contract idea taken seriously: abstracting cloud resources behind compositions so the same claim can map to different providers. It is essentially the open-source embodiment of a lot of what the commercial BYOC vendors are doing behind their UIs, so reading how it is built tells you what is going on under the hood elsewhere.
Look at Cluster API (CAPI and its Azure flavour CAPZ) as the alternative approach, this time aimed at cluster lifecycle rather than general resources. Where Crossplane wants to be a universal control plane for everything, Cluster API is specifically about the birth, upgrade and death of Kubernetes clusters themselves. Learning the Crossplane-versus-CAPI tradeoff is worth doing deliberately, because it is the same product-shape question from the quadrant in miniature: how broad do you make the thing you reconcile, and what do you give up in depth when you go wide, or in breadth when you go deep.
Finish with the part nobody enjoys and everybody underestimates: Day-2 operations across a boundary you do not fully control. Upgrades, drift detection, observability, and incident response when you have limited access to the very environment the workload runs in. This is where BYOC stops being a clever architecture and becomes an actual operational discipline, because the data plane is in someone else's account and you cannot just SSH in and poke around when something is on fire. If the earlier exercises teach you how BYOC works, this one teaches you why it is hard to run, and that is the lesson that separates people who have read about BYOC from people who have actually lived with it.

Wrapping up

If you take one thing from all this, let it be that BYOC is a narrow word that has been stretched to cover a lot of things it does not really mean. The version worth the name is specific: your workloads and data run in your own cloud account, behind your own boundary, and the vendor never sees the bytes or stands between you and your infrastructure bill. Everything else, the products that see your data, terminate your requests, or resell you compute, can be perfectly good software. It is just answering a different question.

Once you hold that line, the rest of the picture falls into place:

The control plane stays with the vendor and the data plane stays with you.
The real players sit in one corner of the map, managed experience plus your-cloud execution, and the pretenders each miss on one axis or the other.
The reasons to take this on come down to a benefit you can name in a single sentence, committed spend, a residency rule, a GPU reservation, and if you cannot name it, that is your answer.
The thing that actually defines a given product is not its landing page but where it sits on two axes: how abstract its contract is, and who owns the source of truth it reconciles against.

The business shadow is the part I would not skip, because it explains the behaviour you will run into as a buyer. A BYOC vendor has walked away from the compute markup that funds a normal PaaS, which makes for a cleaner business but a harder one to price, and it puts their fee right next to your AWS bill where you can stare at it every month. That transparency is a feature if you are the customer. It just means the vendor has to be worth the line item, every cycle, in a way a PaaS never has to prove.

And if any of this made you want to understand it for real rather than from the outside, go build the small version. Stand up a control plane, push a workload into a second account through a cross-account role, write a controller that reconciles a spec, and then try to operate the whole thing across a boundary you do not fully own. That last part, the unglamorous Day-2 reality, is where BYOC stops being an architecture diagram and becomes a thing you actually understand.

Last thing, and I will say it plainly: this is my read of the space, assembled from my own work and from research across Northflank, Render and others while I was trying to get it straight in my own head. Treat it as one engineer's mental model, not gospel. If your experience points somewhere different, I would genuinely like to hear it.

Teaching Alfred to Remember with a Neuroscience-Inspired Memory System for AI Agents

JOOJO DONTOH — Sun, 19 Apr 2026 07:38:31 +0000

Introduction

My people, its me again. A few months back, I wrote about Alfred, my personal agentic AI assistant. Alfred lives between my inbox, my calendar, my DevOps boards, Teams, and even my robot vacuum. That article walked through how I stitched all those tools together into one agent that actually does things for me instead of just talking about them. If you have not read it, this is a summary: one agent, many tools, enough autonomy to be useful without being reckless.

Near the end of that article, I wrote about something I called the Floodgate Effect which is the moment when adding one more capability suddenly unlocks ten more that were waiting on it. You add calendar access, and now email triage becomes scheduling. You add Teams, and now a "quick question" at 9pm becomes a draft reply before you have finished your coffee. Each new sense Alfred gains does not just add to what he can do; it multiplies it. This article is the direct result of one of those floodgates finally opening. The floodgate of memory.

Now, let us be honest with ourselves. Personal AI agents are everywhere now. Every week there is a new one. Some are wrapped in sleek UIs, some are open-source projects with thousands of stars, some are just clever prompts pretending to be agents. But walk up to any of them and ask, "do you remember when we talked about my mother's birthday?" and watch what happens. Most of them will smile politely and start from zero smh. A few will pull something from storage that is technically related but practically useless. Very few will actually recall the conversation the way a friend would. That gap, that is what this article is about.

The Problem

For all the tools Alfred had access to, he was still, in a very real sense, a stranger to me. A capable stranger, yes, but still a stranger. And the deeper I worked with him, the more I noticed the same three cracks showing up again and again.

He could not remember our conversations

Ask Alfred, "do you remember when we talked about switching electricity providers?" and you would get a blank stare dressed up in polite language. The conversations we had, they lived in the chat window and nowhere else. Once a session ended, whatever we discussed was gone. No recall, no reference, no thread to pull on.

And it got worse with vague questions. "What happened to my money at the beginning of January?" is the kind of question I would ask a human assistant without thinking twice. A human would piece it together from a hundred small signals: the bills that came in, the transfers I mentioned in passing, that one receipt I complained about. Alfred could not do any of that. He could search a single email inbox by keyword, or scan receipts by date, but he could not triangulate across everything the way memory actually works. This was quite important to me tbh

He did not know the basic facts about my life

I would love for alfred to keep some facts about me. Basically for him to have small permanent truths that make someone feel like they actually know you. Every conversation was fresh and started from 0 with 0 important context.

Now, you might ask, does this really matter for an AI assistant? I will tell you why it matters. Without these permanent facts, Alfred could not handle personal tasks with any real context. "Send my sister a birthday message" required me to spell out who my sister is, what her email is, and when her birthday falls, every single time. The value of an assistant collapses the moment you have to brief him from scratch on every task. At that point, you are not being assisted; you are doing the work twice.

He could not learn patterns from experience

This was the subtlest gap, but the most human one. Think about how you yourself learn someone's preferences. You do not sit down and write a list. You order food for your mother five times, and each time she says "no mayonnaise," and by the sixth time you do not need to ask. That piece of information, it has quietly graduated from a passing comment into a permanent fact about her. You did not decide to remember it. Your brain decided for you.

Alfred had no mechanism for this. Every "no mayonnaise" was a fresh instruction. The tenth time was treated exactly like the first. He could execute orders beautifully, but he could not notice what those orders were teaching him about the people in my life.

The sum of the problem

Put these three gaps together and you get an agent that is technically impressive but emotionally flat. He could do the work, but he could not feel like someone who knew me. And that is the thing about a personal assistant: the "personal" part is not decoration. It is the whole point.

Methodology

Before writing a single line of code, I forced myself to stop and think. Because the truth is, my first instinct was wrong. I was looking at the problem reactively, listing the things Alfred could not do and thinking of patches for each one. Patch the recall. Patch the personal facts. Patch the pattern learning. Three patches, three features, ship it.

But patches do not build a memory system. Patches build a mess. What I actually needed was a proper frame, a way of thinking about memory that would tell me which features belonged and which ones were distractions dressed up as good ideas.

Borrowing from how humans actually remember

So I went back to basics and read. Not AI papers, but some cognitive science excepts. And the first useful thing I learned was that human memory is not one thing. It is at least two: episodic memory and semantic memory.

Episodic memory is the record of specific events. "Last Tuesday I had lunch with my sister and she mentioned she was changing jobs." It has a time, a place, a context. It fades, it decays, and most of it eventually disappears.

Semantic memory is different. Semantic memory is the distilled facts you carry with you. "My sister works as a Dr." "My mother does not like sugar." "My girlfriend's birthday is in the middle of the year." These do not have a timestamp attached because they are not events; they are truths. They survive long after the specific conversations that produced them have been forgotten.

The moment I saw this distinction, the three gaps from the previous section snapped into place. Gap one was broken episodic memory. Gap two was missing semantic memory. Gap three was the absence of any mechanism to move facts from episodic into semantic, which in humans is a process called memory consolidation. I was not dealing with three separate features. I was dealing with one missing system.

Discovering that OpenClaw had walked this path already

Now, I am not the first person to think about this for AI agents. I went looking to see who else had wrestled with it. OpenClaw had already built something that went viral so I decided to see how Peter Steinberger handled memory. For those who dont know, OpenClaw is an open source personal AI assistant project. Peter's approach to memory was almost aligned with what I had been reading in cognitive science. They had not just borrowed the vocabulary; they had borrowed the architecture. OpenClaw then became an added inspiration

Here is what I pulled from studying their implementation:

Temporal decay. Memories lose weight over time, modelled on Hermann Ebbinghaus's 1885 study of his own forgetting curve. A memory from yesterday outranks a memory from last month, not because it is more relevant but because recency matters to human cognition.
Tiered storage between semantic and episodic memory. Two stores, two shapes of data, two different lifespans. Exactly the distinction I had just read about.
A memory consolidation pipeline via dreaming. A background process that runs when the agent is idle, reviews the episodic record, and promotes important patterns into semantic memory. Inside the dream, a promotion scoring formula decides what graduates and what does not.
Recall tracking. Every time a memory gets retrieved, that fact is logged. Popular memories earn their place; unused ones fade. For engineers think of this as some kind of cache
Diversity ranking. When results are returned, near-duplicates get pushed down so the top of the list is actually varied rather than five versions of the same thing.
Hybrid search combining full-text search with vector similarity, so both literal keyword matches and semantic matches count.
Content hashing for skip-reindex. If a piece of content has not changed, do not waste cycles re-embedding it.
Append-only memory writes. You never overwrite memory in place; you add to it. The history is preserved.
Chat tools like memory_search that let the agent query its own memory during a conversation, rather than memory being a silent background system.

Picking a north star

Here is where I had to be disciplined. Once you find a toolkit this rich, the temptation is to grab everything and start wiring it up. But tools without a north star become a pile of features, and a pile of features is not a system. I needed one guiding principle that would let me decide, for every new idea, whether it belonged or not.

The principle I settled on was this: neuroscience is the pillar. If a feature had a clear analogue in how humans remember, it was in. If I could not explain it in cognitive terms, it was out or it was waiting for a better justification. This single rule has saved me from at least a dozen clever ideas that would have been technically satisfying and practically useless.

With that north star in place, I took what the OpenClaw project had built, adopted what fit Alfred's shape, and started adapting the rest. Credit where it is due: the OpenClaw work is excellent, and a good portion of what I am about to describe stands on those shoulders. Huge thanks to Peter for this.

Implementation

This is the part where theory meets code. I will walk through the pieces one at a time, in the order that made sense as I was building, and I will show the actual code where the logic lives. Each part maps back to something from the Methodology section, and each one is anchored in the north star: if there is no neuroscience analogue, it does not belong.

Part 1: The cognitive-science mapping

Six mechanisms in Alfred's memory system track real cognitive-science models of how human memory works. Each one earned its place by passing the north star test. Let me walk you through all six.

1. The exponential forgetting curve (temporal decay)

In 1885, Hermann Ebbinghaus, a German psychologist, did something most researchers would never dare to do. He ran his experiments on himself. Day after day, he memorised lists of nonsense syllables and then tested how much he could recall at different intervals. From that long, lonely effort, he produced one of the oldest and most durable findings in psychology: human retention decays exponentially.

The formula looks like this:

R(t) = e^(-t/S)

where R(t) is the probability of recalling an item after time t, and S is a per-item "strength" constant. Some memories have a high S, so they last longer. Others are fragile and fade in hours. But the shape of the curve, it is always exponential.

That is the exact formula Alfred uses to rank memories:

rawScore * exp(-λ * ageDays)    // λ = ln(2) / halfLifeDays

I chose a 30-day half-life, which means a memory's score drops to 50% of its original weight after thirty days, 25% after sixty, 12.5% after ninety, and so on. The key insight, the one that took me a while to fully appreciate, is this: older memories do not vanish. They just become less accessible. They still exist in the store. They still match queries. They just have to work harder to surface above fresher material.

This is exactly how human memory feels. A six-month-old email about the Acme contract is still in your head somewhere. You can recall it if you really try. But an equally-relevant email from yesterday, it will pop into your mind first without any effort. Alfred now behaves the same way.

The choice of thirty days is not arbitrary, and it is not borrowed wholesale from Ebbinghaus either. There is a full calibration argument behind it, including how it compares to Ebbinghaus's actual numbers and how to tune it for different use cases. I will get to that in Part 3 of the article.

2. Semantic vs episodic memory (tiered storage)

In 1972, a cognitive psychologist named Endel Tulving drew a line through the concept of long-term memory that researchers are still building on today. He said human long-term memory is not one system but two:

Episodic memory holds specific events tied to time and place. "The meeting last Tuesday where Jane raised the budget concern." It has a when and a where. It is autobiographical.
Semantic memory holds timeless facts and concepts. "Jane is on the finance team." No timestamp. No location. Just a stable truth about the world as you know it.

These two systems work together but are structurally different. And that structural difference is what made me realise Alfred did not need one memory store. He needed two.

In the implementation, the dividing line is a single field: datedAt. Every chunk that Alfred ingests either has a date or it does not. If it has a date, it is episodic. It lives in the temporal store, and the forgetting curve from mechanism 1 applies to it. If datedAt is null, it is semantic. It is an evergreen fact, and it does not decay. Ever.

This maps cleanly onto what the OpenClaw project does with evergreen files. In OpenClaw, certain files are marked as evergreen and the decay formula skips them. In Alfred, a null on datedAt does the same work. Same idea, different shape.

A few examples to make it concrete:

An email from January 14th about a vendor quote. This has a clear timestamp. datedAt gets set. It goes into episodic memory and starts decaying.
A statement that Jane is allergic to shellfish. There is no meaningful timestamp for this. It is just a fact about Jane. datedAt stays null. It goes into semantic memory and stays at full weight indefinitely.
A bill receipt from last month. datedAt is set. Episodic. Will decay.
My girlfriend's birthday. datedAt stays null. Semantic. Never decays.

This single boolean-ish decision, whether or not a chunk has a date, is one of the most load-bearing choices in the entire memory system. It decides what the forgetting curve touches and what it leaves alone. It decides what "fades gracefully" and what "stays forever." And it gives Alfred the same two-tiered structure that Tulving identified in us more than fifty years ago.

3. The consolidation pipeline (hippocampus to cortex)

Here is where the neuroscience gets beautiful. Inside your brain, new experiences are not written directly into long-term storage. They go first to the hippocampus, a small seahorse-shaped structure that acts like a fast, temporary notebook. The hippocampus captures new information quickly, but it does not hold things forever. Its job is to stage experience, not to store it.

The actual long-term storage is the neocortex, the thick outer layer of the brain that holds your facts, your concepts, your permanent knowledge of the world. And the transfer from hippocampus to neocortex, that is the work of sleep. During slow-wave sleep and REM sleep, patterns that have been repeatedly activated in the hippocampus get gradually written into cortical networks. This process is called memory consolidation, and it is why sleep deprivation wrecks learning. You can drill a concept all day, but if you do not sleep, it never quite makes it to permanent storage.

Now, look at what Alfred's memory looks like in the same language:

Cognitive structure	Alfred equivalent	Role
Working memory	Current chat session	Fast, volatile
Hippocampal traces	`memory_chunks` with `datedAt`	Recent, subject to decay
Neocortical knowledge	`memory_chunks` with `datedAt = null` (Evergreen sourceType)	Consolidated, permanent
Sleep consolidation	`DreamPromotion` cron job (24-hour interval)	Promotes hot traces upward

Every row in that table has a direct neuroscience analogue. The north star doing its job.

Here is how the pieces came together, in the order I built them:

First, I built the hippocampus. The dated memory_chunks store captures everything Alfred encounters: emails, receipts, conversations. Fast to write, decays over time, never meant to be the permanent record. This is mechanisms 1 and 2 above doing their work together. Individually, dated storage and temporal decay do not mean much. Combined, they are a hippocampus.
Second, I added the sleep cycle. A scheduled job called DreamPromotion runs once every 24 hours. It looks at the hippocampal traces, scores them, and promotes the hot ones into neocortical storage. The scoring formula takes into account how often a chunk was recalled, how diverse the queries that surfaced it were, and how the chunk has aged. Not every trace makes it. Most do not. But the ones that do, they become evergreen.
Third, I made the whole thing queryable from chat. Alfred gained a memory_search tool and a memory_remember tool, so he can consult his own memory mid-conversation and write to it explicitly when I tell him to. Without this, the memory system would have been a silent background process. With it, memory became a first-class part of the conversation.

A note on the terminology. The OpenClaw project calls their scheduled consolidation job "dreaming," and I kept that name when I adapted the idea for Alfred. It is not a cute branding choice. It is load-bearing. The word "dreaming" communicates, in one syllable, what this process is and why it exists. It signals that the agent is doing something humans also do, and that the purpose is consolidation rather than active work. Every time I see DreamPromotion in the codebase, I am reminded of what I am actually modelling. That is the kind of name that pays for itself.

4. Spreading activation and retrieval-induced strengthening

John Anderson, a cognitive scientist at Carnegie Mellon, built a framework called ACT-R (Adaptive Control of Thought-Rational) that models how the mind retrieves memories. One of the central claims of ACT-R is this: retrieval is not passive lookup. Every time you successfully recall a memory, the trace itself gets strengthened. It does not just sit there being remembered; it becomes easier to remember the next time.

This is the mechanism behind what educators call the testing effect. It is why flashcards work. It is why rereading a textbook, by contrast, barely moves the needle. Passive rereading leaves the traces untouched. Active recall drives them deeper into long-term storage. The memories you use grow stronger; the ones you ignore fade faster.

Alfred's recall tracking is exactly this idea, wearing a software hat. Every time memory_search returns a chunk that actually gets used, the event is logged to a table called memory_recall_log with a normalised query hash, the score that surfaced the chunk, and a timestamp. Over time, three signals accumulate:

Frequency, how often the chunk has been recalled
Recency, how recently it was last recalled
Diversity, how varied the queries are that keep pulling it up

These three feed into a promotion score, and chunks that cross the threshold get graduated into a location that is cheaper to reach. The chunks that never get pulled up, they decay and eventually stop appearing in results altogether. The memory rewires itself around what actually gets used. That is not a metaphor. That is what the code does.

5. Pattern separation (MMR diversity ranking)

There is a specific structure in the hippocampus called the dentate gyrus, and its job in the brain is almost architecturally neat. It performs what neuroscientists call pattern separation: it takes similar experiences and makes them distinguishable so they do not smear together in memory. Without pattern separation, yesterday's commute and today's commute would blur into a single vague memory of "commuting." With it, you can recall them as distinct episodes even though they are 95% the same.

Alfred does the computational equivalent using a technique called Maximal Marginal Relevance, or MMR. The logic is simple. If three different emails all mention the Q3 marketing budget, stuffing all three into the agent's context window tells him nothing new. He already has the Q3 marketing budget point. What he needs is breadth. MMR handles this by picking the most relevant result first, then picking the next result that is both relevant and different from what has already been selected:

score = λ * relevance - (1 - λ) * maxSimilarityToSelected

The λ parameter decides how aggressively to push for diversity. A higher λ favours raw relevance. A lower one favours variety. What you get is pattern separation as an optimisation function. The top results end up being distinct from one another rather than five near-duplicates of the same point.

6. Mental time travel (temporal retrieval)

In 1997, Thomas Suddendorf and Michael Corballis introduced a term that has stuck in cognitive psychology ever since: mental time travel. Refined further in their 2007 work, the concept describes the uniquely human ability to mentally project yourself backward or forward in time. Backward, you re-experience past episodes. Forward, you pre-experience future ones. Either way, you are navigating deliberately to a specific temporal location in your mind.

This is what lets you answer a question like "what happened at the party last Saturday?" You do not sit there hoping the memory is fresh enough to float to the top. You go to last Saturday. You project yourself to that point in time and then recall from there.

Now, this is different from the forgetting curve, and the difference matters. Ebbinghaus-style decay governs accessibility, how easily a trace comes to mind when you are not even looking for it. Mental time travel governs deliberate retrieval, the ability to restrict your search to a specific window of time regardless of how activated the traces are. You need both. Without decay, everything would feel equally loud and you could never find the important things. Without temporal retrieval, you could not quiet the present long enough to hear the past.

The gap before this fix

After implementing mechanisms 1 through 5, Alfred had the forgetting curve but not mental time travel. He could tell you what was top-of-mind, meaning the fresh and frequently-recalled stuff, but he could not deliberately navigate to a specific time window.

A question like "what was happening with Acme in January?" would fail in practice. Why? Because January chunks had been decaying for three or more months by the time I asked. They ranked near the bottom no matter how relevant they were. The decay mechanism was doing double duty: modelling both "what is fresh" and "when did it happen." Those are two different questions, and one formula cannot answer both.

The fix

I added two optional date fields to MemorySearchOptions, after and before, both ISO-8601 strings. The FTS5 query gained a temporal boundary clause:

AND (c.dated_at IS NULL OR c.dated_at >= :after)
AND (c.dated_at IS NULL OR c.dated_at <= :before)

Notice something important in that SQL. Evergreen chunks, the ones where dated_at IS NULL, always pass through the filter. And that is correct. If the user asks "what do I know about Acme from January?" and there is an evergreen fact that says "Acme's account manager is Jane," that fact is just as true in January as it is now. It is a timeless truth, and a time window should not exclude it.

How decay and windowing interact

Within a time window, decay still applies. A chunk from January 15th still ranks above one from January 2nd, all else equal. The window restricts the candidate set; decay ranks within it. Two mechanisms, each doing its own job, working together.

This matches how mental time travel actually works in humans. You project to "the party last Saturday," which is the window. Then the most salient moments from that episode surface first, which is activation and recency within the window. The two mechanisms are orthogonal. One filters. The other ranks.

How the agent routes

The memory_search chat tool now has optional after and before fields. The intent-extraction LLM picks between Alfred's structured tools and memory_search based on the shape of the question:

Specific source, with time. Use a structured tool: search_emails, bill_receipts, finance_query. These have their own date-range parameters and return richer, source-specific results.
Cross-source, with time. Use memory_search with after and before. A question like "what was going on in January?" spans emails, receipts, and conversations. The temporal window scopes the search; decay and MMR rank within it.
Cross-source, no time. Use memory_search without date filters. Standard recall: decay naturally favours recent material, and the user has not asked to override that default.

That simple three-way split captures most of how a human would mentally sort the question before even beginning to remember. Is this about one specific place? Then go there. Is this about a particular time across everything? Then project there. Is this about everything in general? Then let the most activated memories float up on their own.

Part 2: How the pieces actually work

Part 1 told you why each mechanism exists. This part tells you what the machinery actually is, so you can reason about its behaviour and its limits. If the first part was cognitive science, this one is the wiring diagram.

FTS: how the index finds matches

The problem full-text search solves is simple. How do you find documents by what is inside them, fast?

Without FTS, searching for an email mentioning "contract acme" means scanning every row with body LIKE '%contract%' AND body LIKE '%acme%'. On fifty thousand rows, that is fifty thousand full-text comparisons per query. Slow. It also cannot rank by relevance, cannot handle partial matches well, and cannot tell you which result is "most about" your query.

FTS builds what is called an inverted index, which is a map from every word to every row that contains it. Think of the index at the back of a textbook:

"acme"     → rows 42, 108, 391, 2504
"budget"   → rows 12, 42, 205, 1337, ...
"contract" → rows 42, 73, 391, 2504, ...

When you query "contract" AND "acme", the engine intersects the two lists in milliseconds and gives you back rows 42, 391, and 2504. No row scanning. No LIKE wildcards. No sweating.

Alfred uses FTS5, which is version 5 of SQLite's built-in full-text search module. It ships inside better-sqlite3, so no new dependencies were added. What I got out of the box:

MATCH syntax, for example WHERE memory_chunks_fts MATCH '"contract" AND "acme"'.
BM25 ranking, a standard relevance formula from the same family Google used before they moved to machine learning ranking. It scores by term frequency, inverse document frequency, and document length.
Sublinear scaling. Search time grows much more slowly than the corpus. One hundred thousand chunks still return in a few milliseconds.
A pluggable tokeniser. I use unicode61, which handles accents, case, and punctuation correctly without any extra configuration.

BM25: how FTS ranks

BM25 returns a number where lower is better. A rank of 0 means a near-perfect match; higher numbers mean progressively worse matches. That is backwards from how the rest of Alfred's pipeline thinks about scores, so the memory adapter inverts it into a 0 to 1 relevance score:

rawScore = 1 / (1 + max(0, bm25_rank))
// bm25 ≈ 0  → rawScore ≈ 1.0
// bm25 = 1  → rawScore = 0.5
// bm25 = 9  → rawScore = 0.1

This is a conventional way to turn BM25 into a bounded score, and it keeps the arithmetic stable when you later multiply it by decay coefficients, which also live in [0, 1]. Everything in the ranking pipeline now speaks the same language.

MMR: the algorithm, up close

I already covered what MMR is conceptually in Part 1 and gave the formula. Here I want to show you what actually happens when it runs.

The algorithm is greedy, meaning it picks one result at a time rather than trying to optimise the whole list at once:

Pick the highest-scoring candidate first. Nothing has been selected yet, so there is nothing to be similar to.
For each remaining candidate, compute its MMR score against the set of items already picked.
Pick the winner. Repeat until you have N results.

For the similarity measure, I used Jaccard similarity on lowercased word tokens:

jaccard(A, B) = |A ∩ B| / |A ∪ B|

Two identical texts score 1.0. Totally disjoint texts score 0.0. It is cheap to compute, needs no embeddings, and is good enough for catching near-duplicates in short text.

Now let me show you the algorithm working on a real-looking example. Suppose BM25 returns five candidates for the query "marketing budget", all with relevance around 0.9:

A: "Please approve the Q3 marketing budget"            0.95
B: "Please approve the Q3 marketing budget for sure"   0.94   (near-dup of A)
C: "Please approve the Q3 marketing budget (revised)"  0.93   (near-dup of A)
D: "Marketing budget alert: Q3 dining over limit"      0.70
E: "Budget meeting notes: marketing, eng, ops"         0.65

Without MMR, your top three results are A, B, C. All three are essentially the same email. Results two and three teach the agent nothing he did not already get from result one.

Now watch what happens with MMR at λ = 0.7:

A is picked first. It is the highest raw score.
For B: relevance is 0.94, but Jaccard similarity to A is around 0.9. MMR score = 0.7 · 0.94 − 0.3 · 0.9 = 0.39.
For D: relevance is 0.70, but Jaccard similarity to A is around 0.3. MMR score = 0.7 · 0.70 − 0.3 · 0.3 = 0.40.
D wins second place, even though its raw relevance was much lower. The same logic picks E third.

The final top three is A, D, E. Three genuinely distinct perspectives on the marketing budget instead of three copies of the same thread. That is pattern separation, paying for itself every single query.

Temporal decay in concrete numbers

I showed the decay formula in Part 1. Here is what it actually produces, which is the part that makes the behaviour click:

Age of chunk	Decayed score at `rawScore = 1.0`
Today	1.0
30 days ago (one half-life)	0.5
60 days ago	0.25
90 days ago	0.125

Two implementation details are worth naming:

Null-dated chunks bypass decay entirely. This is the evergreen tier from Part 1, enforced in code rather than hoped for.
Future-dated chunks are treated as "now." This is an edge case, but without it, a chunk dated tomorrow would get an amplified score above rawScore, which is nonsense. Clamping the age to zero prevents the math from misbehaving.

Content hashing: why the indexer is cheap to call

Every indexer in Alfred, whether it is IndexEmailMemory, IndexBillReceiptMemory, IndexConversationMemory, or AddEvergreenFact, follows the same pattern:

const contentHash = sha256(text);
const existing = await port.getSourceHash(sourceType, sourceId);
if (existing === contentHash) return;  // skip the re-index
await port.upsertSource({ sourceType, sourceId, contentHash, chunks });

The memory_sources table stores the last known hash alongside each source. On repeat calls, the hash lookup short-circuits the expensive work of chunking and rewriting the FTS index. If the content has not changed, nothing happens.

This is why it is safe to call IndexConversationMemory on every single chat message insert. Technically, a new message's content never matches an existing hash, so the skip never fires on the insert itself. But on retries, reprocessing, or replaying an event stream, the hash lookup makes the call a no-op. Idempotency for free.

Chunking: how long documents get split

The chunker in chunkEmail is intentionally simple:

Concatenate subject + "\n\n" + body.
Strip HTML tags and decode common entities. Gmail bodies are often HTML, and the tags carry no signal worth indexing.
Normalise CRLF line endings and collapse runs of three or more newlines down to two.
Split on blank lines into paragraphs.
If a paragraph exceeds MAX_CHUNK_CHARS (set to 1200), hard-split at that boundary.
Drop anything below MIN_CHUNK_CHARS (set to 40). Too short to contribute useful signal.

Receipts, chat conversations, and evergreen facts skip chunking altogether. They format the whole record into a single text block and index that as one chunk. This asymmetry is deliberate. Emails have multi-paragraph bodies where different paragraphs carry different meaning, so per-paragraph addressing is worth the effort. Receipts and conversations are coherent units where splitting would fragment the signal instead of sharpening it.

Recall tracking: what the system actually observes

The concept of recall tracking was covered in Part 1. Here are the shape and the subtle details.

Every successful SearchMemory.run(query) that returns at least one result emits a RecallEvent per returned chunk into the memory_recall_log table:

interface RecallEvent {
  chunkId: string;
  queryHash: string;   // sha256 of lowercased + normalised query
  score: number;       // the finalScore at recall time
  recalledAt: string;  // ISO 8601
}

The query is normalised before hashing. Trim whitespace, collapse internal runs of whitespace, lowercase the whole thing. This means "Acme contract", " acme contract ", and "ACME CONTRACT" all produce the same hash. This matters. Without normalisation, those three cosmetic variants would count as three distinct queries, and the diversity signal would be garbage. With normalisation, we can honestly say "this chunk was surfaced by N genuinely different queries," which is the thing we actually want to measure.

The log is pruned to the scoring window (thirty days by default) at the end of each dreaming run, so it stays bounded no matter how heavy the recall traffic gets.

Promotion scoring: the dreaming formula

When the dreaming job runs, it takes the recall log and aggregates it per chunk into a small summary:

ChunkRecallStats {
  recallCount,
  uniqueQueryCount,
  maxScore,
  lastRecalledAt
}

Each candidate then gets scored with a four-component weighted formula:

frequency = log1p(recallCount)      / log1p(NORM_RECALLS)   // 0.30 weight
relevance = clamp01(maxScore)                               // 0.35 weight
diversity = log1p(uniqueQueryCount) / log1p(NORM_QUERIES)   // 0.20 weight
recency   = exp(-ln2/14 · daysSinceLastRecall)              // 0.15 weight

promotionScore = weighted sum, clamped to [0, 1]

The weights sum to 1.0 by design. The log1p scaling on frequency and diversity prevents a handful of viral queries from dominating the score. A chunk recalled one hundred times does not get a hundred times the credit of a chunk recalled once; it gets roughly twice as much, which feels closer to how human attention actually works.

This is a deliberately simplified version of the six-component scoring OpenClaw uses. I dropped the "consolidation phase signal" and "conceptual tag" terms because Alfred does not have the inputs to compute them honestly, and I redistributed their weights across the remaining four. A simpler formula you fully understand is better than a richer one you half-trust.

The thresholds for promotion are conservative by design:

score ≥ 0.60 AND recallCount ≥ 3 AND uniqueQueryCount ≥ 2

In plain English: the chunk must be sufficiently scored, recalled at least three times, and hit by at least two genuinely different queries. A single viral query cannot force a promotion. A single user cannot accidentally promote a junk memory by asking the same question over and over. Both volume and variety are required.

All six constants in the scoring formula live in a single place. Tuning any of them is a single-line change.

Part 3: Calibrating the decay (why 30 days?)

Somebody who has heard of Ebbinghaus is going to read Parts 1 and 2 and immediately ask: "Hold on. Is not the famous forgetting curve a one-hour half-life? Why is Alfred sitting at thirty days?" The short answer is that the comparison is misleading. The longer answer is what this section is for, and I will keep it tight.

Ebbinghaus measured something very specific. He memorised nonsense syllables, with no rehearsal, no meaning, and no context. Under those brutal conditions, retention did halve in about an hour. But the formula is R(t) = e^(-t/S), and the S is a stability constant that depends entirely on what kind of memory you are modelling. Meaningful, contextually-encoded information you engage with periodically has a vastly higher S than random syllables in a lab. The shape stays exponential; only the constant changes. Quoting "one hour" as if it were a universal forgetting rate is the same error as quoting "100 °C" as if it were a universal boiling point. True under one set of conditions, badly misleading otherwise.

The other thing to clear up is what "decay" even means here. Ebbinghaus measured retention: can you reproduce the syllable at all? Alfred measures ranking: where does this chunk sit in a search result list? A six-month-old chunk in Alfred is at a score of about 0.016, but it is still fully searchable, fully readable, fully present. It just will not outrank a one-day-old chunk on the same query. This is much closer to John Anderson's ACT-R concept of memory activation (how easily does it come to mind?) than to Ebbinghaus's retention (is it there at all?). In plain terms: Alfred does not forget. It buries. If a user explicitly searches for "Acme contract from October," the FTS5 index will find it. Decay only kicks in when ranking needs to choose between several matches.

Given all of that, why thirty days specifically? Three reasons:

Monthly bill cycles. Receipts you paid this month are most relevant; last month's are less so; older ones are usually closed business. Alfred handles my bills, and the domain has a monthly heartbeat.
Typical lookback in conversation. "What did Jane say last week?" or "last month?" are common. "What did Jane say last quarter?" is rare.
The promotion path does the rest. Anything important enough to survive past three months gets picked up by the dreaming job and promoted to evergreen, where decay does not apply at all. So the half-life only needs to behave well on the window before promotion has had a chance to do its work.

It is a design choice, not a discovered constant. And because it is a design choice, it is tunable:

Half-life	Behaviour	Best for
7 days	Aggressive. Last week dominates everything.	News-like or fast-moving inboxes where stale information is actively harmful.
30 days (current)	Monthly rhythm. Last month strong, older fades.	Personal email, bill receipts, chat. Alfred's actual use case.
90 days	Gentle. Quarterly cycles; last three months still feel recent.	Project work, long-form research, slower domains.
365 days	Effectively no decay. Used only as a tie-breaker.	Reference material that should never fade.

The constant lives in temporal-decay.ts as HALF_LIFE_DAYS = 30. If retrieval ever surfaces too much old material, I lower it. If useful older content keeps getting buried, I raise it. It is a knob, not a law.

Part 4: Memory versus RAG

Now, before we go any further, I want to head off a misconception that a certain kind of reader will already be forming. "This is just RAG," they are thinking. "Chunking, indexing, ranked retrieval, feeding context into an LLM. You have built RAG and you have called it memory."

I need to push back on that, because it is half right and half importantly wrong.

RAG, or Retrieval-Augmented Generation, is a pattern for feeding context into an LLM at inference time. The usual shape is this: embed the question, run a vector search over a document store, stuff the top chunks into the prompt, and let the LLM generate a grounded answer. RAG is a stateless lookup plus generation pipeline. It is excellent at what it does, and Alfred's memory system genuinely does borrow RAG's retrieval step. Chunk, index, search, return ranked results. That spine is the same.

But look at what diverges from there:

	Typical RAG	Alfred's memory
Purpose	Answer questions from a static knowledge base	Build a living, decaying, self-organising memory for a persistent agent
Index lifecycle	Batch-ingested, mostly static	Incremental. Auto-indexes on every email upsert, receipt extraction, and chat message
Retrieval	Vector similarity, one-shot	FTS5 keyword search today, embeddings planned for a later phase
Time awareness	None. A three-year-old doc ranks the same as today's.	Exponential temporal decay with a 30-day half-life
Diversity	Usually top-K by score, often redundant	MMR re-ranking actively suppresses near-duplicates
Consolidation	Does not exist	Recall tracking plus a dreaming cron job that promotes hot chunks to evergreen
Source mixing	Usually one corpus	Emails, receipts, conversations, and evergreen facts interleave in one ranked result
Statefulness	Stateless. No memory of past retrievals.	Stateful. Every recall is logged, and the log drives promotion.
Agent access	Prompt-side injection only	First-class chat tools: `memory_search`, `memory_remember`, `memory_forget`, `memory_list_facts`

Read that table carefully, because the pattern in it is the whole point. The first five rows describe a retrieval system. The last four describe something else entirely: a system that changes over time based on how it gets used. That is what separates memory from lookup.

A RAG pipeline does not know or care whether you have searched for the same thing a hundred times. Alfred does, and the dreaming job rewards the chunks that keep getting pulled up. A RAG pipeline treats a document from three years ago exactly the same as one from this morning. Alfred does not, because recency is a signal about what is likely to matter now. A RAG pipeline serves the retrieval and then forgets you were ever there. Alfred writes your recall back into its own state and uses it to shape tomorrow's results.

Today, the retrieval side of Alfred is FTS5 rather than vectors, but that is a tactical choice, not a philosophical one. A later phase will add embeddings on top, making this a hybrid vector plus full-text pipeline. None of the contracts above will need to change when that happens. The decay, the consolidation, the MMR, the recall log, the evergreen tier: all of them sit above the retrieval layer and are agnostic to how the candidate set gets produced.

So yes, Alfred's memory uses RAG-grade retrieval. But the three things that push it beyond RAG are time, learning, and source mixing. Those are what make it worth calling memory instead of search. A system that finds things is retrieval. A system that finds things, notices what you keep asking for, promotes the persistent patterns, and fades the rest, that is something else.

Gaps, Discoveries, and Adaptations

When you adopt a design wholesale from somewhere else, the easy thing is to copy it and move on. The honest thing is to keep looking at it after you have copied it, because the shape that works for the original is almost never the exact shape that works for you. Every time I looked at what I had ported from OpenClaw, I found one more thing that was either incomplete for my use case or missing entirely. What follows is the short list of those discoveries and what I did with each one.

The forgetting curve alone is not enough

This is the discovery I covered in depth in Part 1, so I will not re-litigate the fix. But the shape of the gap is worth naming clearly here, because it is the most important discovery in the whole project.

Alfred had the forgetting curve. What Alfred did not have, and what OpenClaw also does not have, was mental time travel. The system could tell me what was top-of-mind, but it could not deliberately navigate to a specific episode. A question like "what was going on at the start of the year?" failed not because the information was gone but because the ranking formula had buried it under three months of decay.

This was not a small oversight. It was a structural blind spot in OpenClaw's design, and once I saw it, I could not unsee it. Decay answers "what is fresh?" Temporal windowing answers "what happened then?" Both are independently necessary, and OpenClaw had solved only one of them. Adding after and before boundaries to memory_search, with evergreen chunks passing through untouched, is what closed the gap.

OpenClaw assumes one source type. Alfred has four.

OpenClaw indexes markdown files. That is the entirety of its ingestion surface. If it is not markdown, it does not go in.

Alfred's world is messier. My inbox is HTML. My bill receipts are structured JSON extracted from PDFs. My chat conversations are multi-turn transcripts. My evergreen facts are short plain-text statements. Four fundamentally different data shapes, all of which need to coexist in a single index and be ranked against each other in a single result set.

This forced two adaptations that OpenClaw simply did not need to make.

The first was an HTML stripper in the chunker. Gmail bodies arrive wrapped in the usual soup of <div>, <style>, and <script> tags, plus the full menagerie of HTML entities. Indexing that raw would pollute the FTS index with markup tokens that carry zero semantic signal. So before chunking, the email path strips tags, decodes entities, and normalises whitespace. OpenClaw has no such code because OpenClaw never sees HTML.

The second was per-source formatters. Emails go through chunkEmail, which splits on paragraphs and hard-breaks long paragraphs. Bill receipts go through formatBillReceiptText, which produces a single formatted block from the structured fields. Conversations go through formatConversationText, which flattens the turn structure into one indexable unit. All four sources feed the same index, but each has its own strategy for turning raw data into text that is worth indexing. OpenClaw has one formatter because OpenClaw has one source type.

This is a case where adapting a system revealed a whole category of work the original did not have to do. Not a flaw in OpenClaw; just a difference in domain.

OpenClaw's evergreen memory is automatic only. I needed it to be teachable.

In OpenClaw, the evergreen tier is populated by one thing and one thing only: the dreaming job. The system learns what to remember permanently by watching what gets recalled over time and promoting the winners. That is a beautiful mechanism, and I kept it entirely for Alfred.

But it is not enough on its own, and the missing piece is something a user of the system would notice within their first week. Let me walk through it as three separate questions I asked myself:

"What if I want Alfred to remember something specific, right now?" Say I want him to know that my girlfriend is allergic to shellfish. That is a fact. It is not going to emerge from recall patterns over weeks of dreaming; it is just something I want him to know starting today. In OpenClaw, the only way to do this is to manually edit MEMORY.md, which is fine for a developer but is not a user-facing interface. So I added a dedicated path for it: a memory_remember chat tool, a /remember CLI shortcut, and an AddEvergreenFact use case behind both. "Remember that my girlfriend is allergic to shellfish" is all it takes, and the fact lands in the evergreen tier immediately.

"What if an evergreen fact changes or was wrong?" Facts go stale. People move jobs, relationships change, preferences evolve. OpenClaw has no mechanism for removing an individual promoted fact. Once something is in MEMORY.md, it stays there unless you manually edit the file. That is not acceptable for a system that is supposed to know me. So I added memory_forget as a chat tool and RemoveEvergreenFact as the underlying operation. "Forget that my sister lives in London; she moved back to Accra last month" is one message, and the stale fact is gone.

"What if I just want to see what Alfred knows about me?" This one is partly practical and partly about trust. A memory system that silently accumulates facts about you is a memory system you have to take on faith. I did not want that. I wanted to be able to ask Alfred at any moment, "what do you remember about me?" and get a complete, structured answer back. Not a vague summary. The actual list, with IDs, with the method by which each fact was learned (user-taught or dream-promoted), and with the date it was captured. OpenClaw does not expose its memory state this way. The recall tracking is internal infrastructure, not a user surface. So I added memory_list_facts and ListEvergreenFacts to close the loop.

Together, these three tools turn evergreen memory from a one-way funnel into a two-way conversation. Alfred can still learn automatically through dreaming. But now I can also teach him directly, correct him when he is wrong, and audit him whenever I feel like it. And all of this sits on top of the same underlying storage as everything else. An evergreen fact is, in the end, just a chunk in memory_chunks with datedAt = null. No new table, no new schema, no special path through the search pipeline. Just a null where a date would normally go.

A quick word on what did not need adapting

It is worth saying, for balance, that most of what I ported from OpenClaw worked in Alfred without changes. The temporal decay formula is essentially identical. MMR is the same minus some tokenisation tweaks I did not need. The BM25-to-score inversion uses the exact same formula. Append-only writes, content hashing for skip-reindex, the memory_search tool itself: all straight adaptations.

The gaps I have described in this section are the exceptions, not the rule. They mattered because they were structural, not because they were numerous. And each one taught me something slightly different about what makes a memory system feel like it actually belongs to the person using it.

Closing

When I started this work, the problem felt small. Alfred cannot remember things. Fix that. Ship it.

What I found instead was that "memory" is not a feature you bolt onto an agent. It is the thing that decides whether the agent feels like a stranger or like someone who knows you. Every gap I described in the Problem section, they were symptoms of the same underlying absence. Alfred could do the work, but he could not build a picture of me while doing it. He had tools but no continuity.

Fixing that took the long route. A trip through cognitive science, an honest study of what OpenClaw had already solved, a north star borrowed from neuroscience, and then a careful adaptation that respected both where OpenClaw got it right and where its shape did not fit mine. The result is a system where six mechanisms work together to do what one formula never could.

Let me tell you what it actually feels like to use now.

One question, many sources. Before, asking "what was that electricity thing?" meant knowing to check the inbox separately from receipts, separately from chat history. Three different mental models, three different query patterns. Now, one call to memory_search("electricity") returns the TNB email thread, the corresponding bill receipt, and the conversation where we discussed power costs. All ranked together, all in one result set. I do not need to know where the information lives. Alfred does.

Alfred learns what matters to me without being told. Every recall logs quietly. Every night the dreaming job reads the log and promotes the chunks that have earned their place. After a few weeks of normal use, Alfred has accumulated a personalised fact base that he built from observation, not instruction. "TNB bills arrive around the 7th of the month." "Jane's replies usually land within 24 hours." "When I say 'the contract,' I mean Acme." I never had to say "remember this." He noticed.

I can teach him things directly when I want to. "Remember that I am allergic to shellfish" is one message, and the fact lands in evergreen memory. It will surface on the next dinner invitation, the next restaurant receipt, the next meal-related email. And "what do you remember about me?" returns a full list so I stay in control of what he knows and can correct anything that has drifted.

Time-bounded recall across everything. "What was happening with Acme in January?" used to have no good answer. Now, memory_search("Acme", after: "2026-01-01", before: "2026-02-01") returns January's emails, receipts, conversations, and any evergreen facts about Acme, all in one ranked result. Alfred can navigate my timeline, not just my recent memory.

Old information fades gracefully instead of crowding the present. The search used to order results by raw timestamp with no relevance weighting, no decay, no diversity. A vaguely-matching email from today beat a perfect match from last week. Five emails in the same thread all showed up. Now, BM25 ranks by relevance, temporal decay applies a gentle 30-day half-life, and MMR ensures the top results are actually distinct. The ranking feels the way a human would naturally remember: recent and relevant floats up, old noise sinks, and the same thing does not appear five times in a row.

What I am taking away from this

A few things have stuck with me from building this out.

The first is that the right frame is worth more than the right code. When I started, I had a list of features and no theory. I could have built every one of those features, and what I would have ended up with is a drawer full of disconnected tools that did not add up to anything. The episodic-versus-semantic distinction, the hippocampus-to-cortex consolidation story, the ACT-R view of retrieval as strengthening: none of these wrote a single line of TypeScript for me, but every one of them told me which lines were worth writing.

The second is that adapting someone else's work is its own discipline. I did not build most of what is in Alfred's memory from scratch. The OpenClaw project had already done the hard thinking on decay curves, MMR, dreaming cycles, and recall tracking. What I did was put all of that into a new body, one with emails and receipts and conversations and a different shape of user, and in doing so I discovered where the original had its own blind spots. Temporal windowing, user-taught facts, per-source formatters: none of these would have surfaced if I had just built from zero. They only appeared because I adapted. Credit where it is due is not just a polite gesture; it is an acknowledgement that this kind of work is built on other people's shoulders and gets better when we say so openly.

The third is that a good agent is not a clever one; it is a contextual one. I used to think the difference between a useful agent and a toy was how sophisticated the reasoning was. I no longer believe that. The difference is context. An agent who knows my sister's email, my girlfriend's allergies, my monthly bill rhythm, and the conversations we have had is infinitely more useful than an agent with twice the IQ and none of that context. The reasoning is the easy part, honestly. The knowing is the hard part. Memory is how an agent starts to know.

What is next

Alfred is not finished. A few things are already on the path for the next phase.

Embeddings are the biggest one. Today the retrieval side is FTS5, which is excellent at keyword matching but has no concept of semantic similarity beyond what the tokens literally share. Adding a vector layer on top, with a hybrid ranker that blends BM25 and cosine similarity, will close a class of queries that currently fail silently. "What did we discuss about budgets?" should also surface a chunk that says "we looked at the Q3 spending plan" even though neither "budget" nor "discuss" appears in the text. Hybrid search makes that possible.

Beyond embeddings, the things I want to try next all fall into the same family: giving Alfred richer ways to notice what matters. Better promotion signals. Evergreen facts with structured relationships rather than just free text. Maybe a second dreaming phase that looks for contradictions between evergreen facts and flags them for review. All of these are in the direction of the same north star. If humans do something like it in their own memory, it is worth trying in Alfred. If they do not, it is probably a distraction.

One last thing

If you have made it this far, I owe you a real thank you. This article was longer than I meant it to be, but every section earned its place in my head and I hope it earned its place in yours. If you want to see the previous article in the Alfred series, it is here. If you want to go look at OpenClaw, the source of so much of what made this possible, it is here.

And if you take one thing from all of this, let it be this: the "personal" in personal AI assistant is not a label. It is the whole point. An agent that does not know you cannot truly help you. An agent that does, quietly becomes something else. Something closer to the help you actually wanted in the first place.

That is what I am trying to build with Alfred. And for the first time, I think he is starting to feel like it.

How My personal Agent Alfred talks to my vacuum

JOOJO DONTOH — Mon, 23 Mar 2026 08:48:20 +0000

The Idea Came First

Hi guys, I'm here again. After building Alfred here I wanted him to be able to control Juliana, my Xiaomi X20+ robot vacuum. I did not know how that was going to work and I did not have a clear path forward, but the goal was clear tbh. Ask Alfred something and then he can act on the available tools to him. So I started where most network curiosity starts. I ran an nmap scan on my LAN to see what Juliana was actually exposing to the network. All TCP ports were closed. But UDP port 54321 was open and listening. Bingo!.

If you have not read about Alfred yet, I wrote about how I built him here. This feature is a direct result of what I called the Floodgate Effect in that article. The moment Alfred works in one area of your life, you immediately want to connect everything else. Juliana was next on the list. Its weird that I have names for things in my house but yh that's me.

Speaking Juliana's Language

Having a port is not the same as having a conversation unfortunately. I needed to understand what protocol was running on that port and how to speak it. After some research I discovered that Xiaomi smart home devices communicate using MiIO, a proprietary protocol that runs over UDP port 54321. That was the first real breakthrough.

MiIO follows a three step flow.

First you do a handshake. You send a 32 byte "hello" packet made entirely of 0xFF bytes and the device responds with its device ID and a timestamp called a stamp.
Second you send a command. Every command is a 32 byte header combined with an AES-128-CBC encrypted JSON body. The encryption uses an MD5 derived key and IV generated from the device token. The header carries magic bytes, the packet length, the device ID, the stamp incremented by one, and an MD5 checksum.
Third you receive a response in the exact same format and you decrypt it using the same token to get your JSON back.

The JSON itself follows a JSON-RPC style structure. A request looks like this:

{ "id": 1, "method": "get_properties", "params": [...] }

And a response comes back like this:

{ "id": 1, "result": [...] }

It is a clean protocol once you understand it, but getting to that understanding took some work.

The Roadblocks

Getting the Token

Understanding the protocol was one thing. Actually talking to Juliana required a device token, This is the way Xiaomi handles auth in communication with the device, and getting that token turned out to be its own challenge. My first instinct was to extract it programmatically through the Xiaomi cloud, but that path was immediately blocked by captcha. I had to find another way.

I ended up using the Xiaomi Cloud Tokens Extractor tool, which supports a QR code based login flow. That got me what I needed. The token, the device ID, and confirmation that Juliana was registered under the MiIO protocol. With those two values in hand, I could finally start sending real commands.

The MiOT Property System

Modern Xiaomi devices do not use simple named commands. They use MiOT, the Mi IoT specification, where every property and action is addressed by a service ID called siid and either a property ID called piid or an action ID called aiid. Reading the battery level means sending a get_properties request with siid 3 and piid 1. Setting the suction to strong means sending a set_properties request with siid 4, piid 4, and a value of 2. Starting a cleaning session means triggering an action with siid 2 and aiid 1. Every single thing the vacuum does maps to one of these numeric combinations.

Trial and Error All the Way Down

There is no official documentation that maps these IDs for the X20+. I had to probe them manually by iterating through siid values from 1 to 30 and piid values from 1 to 30 and cross referencing what came back against what I could see in the Xiaomi app. It was tedious work. Some of the latest firmware implementations returned values that did not line up with what you would logically expect, which made matching them to real behaviour even harder.

Cleaning History Lives in the Cloud

One limitation I hit fairly early was around maps and cleaning history. Both are stored in the Xiaomi cloud rather than on the device itself. The only thing you can reliably read from Juliana directly is the last cleaning session. I turned this into an advantage by tracking every session result myself and building a local history. That way Alfred always has full context about when Juliana last cleaned, how long it took, and how much area was covered, without depending on the cloud for any of it.

The Real Value

It Is Not About the Commands

Sending commands to Juliana via an API is not the interesting part. The Xiaomi app already does all of that. You can start a clean, dock the vacuum, set suction levels, and check the battery from your phone in seconds. Replicating that through code alone would not be worth writing about.

The interesting part is what happens when those commands become tools that an AI agent can reason about and invoke on your behalf.

Tools Alfred Can Use

Every capability I built around Juliana was wrapped into a tool that Alfred can call. There are 3 of them:

executeVacuumStatus reads the current state of the device including battery level, cleaning mode, error codes, and consumable wear levels.
executeVacuumCommand sends operational commands like start, stop, pause, resume, dock, and locate.
executeVacuumHistory pulls from the locally tracked session log so Alfred can reason about when and where Juliana has cleaned.

What This Actually Looks Like

With those tools in place, my conversations with Alfred around the vacuum feel completely natural. I can ask things like:

Status and maintenance

"What is Juliana's battery level?"
"Does Juliana need any maintenance?"
"How are Juliana's consumables holding up?"

Control

"Start cleaning the living room"
"Clean the master bedroom and the office"
"Send Juliana home"
"Find Juliana" (this makes her announce her location out loud)

History

"When did Juliana last clean?"
"How much has Juliana cleaned today?"
"Show me Juliana's cleaning history"

Combined and conversational

"How is everything at home?"
"Start cleaning the guest bedroom and let me know when it is done"
"Is Juliana's mop pad due for replacement?"

Alfred does not just relay commands. He reads the context, decides which tools to call, and responds with a full picture. That is the difference between a smart home app and an agent.

From a Single Open Port to a Talking Vacuum

What started as curiosity about controlling my robot vacuum and an open UDP port turned into a fully functional integration between Alfred and Juliana. Along the way there were real obstacles and each one had to be solved before the next step was even possible.

Getting the device token could not be automated due to captcha blocks on the Xiaomi cloud, so I used the Xiaomi Cloud Tokens Extractor with QR code login instead. With no official documentation for the property IDs on the X20+, I probed siid 1 through 30 and piid 1 through 30 manually and matched results against the Xiaomi app. UDP has no built in request response correlation so I built a command serialization queue that keeps exactly one command in flight at a time. Hello responses were mixing with command responses so I added an isHelloResponse() check to skip those 32 byte packets. Timeouts were killing subsequent commands so I reset the stamp to zero on timeout to force a fresh handshake. Sending too many properties at once caused failures so I batched them into groups of ten. The locate command was not triggering a beep until I found the right combination at siid 7, aiid 1, and piid 1 set to 1. Consumable values were coming back wrong because the correct properties were at siid 9, 10, 11, and 18 rather than siid 4 where I originally looked.

The one limitation I could not fully solve is map data. The X20+ stores its maps in the Xiaomi cloud and not on the device itself. Valetudo would solve this but that means a total OS flush of the robot which voids my warranty.

Every solution in this list is a direct result of building things the right way rather than the fast way. The command queue, the constants file, the local history store, all of it exists because the goal was never just to control a vacuum. The goal was to give Alfred enough context and capability to reason about the home the same way he reasons about a calendar or an inbox. Juliana is now part of that picture.

An Autonomous, Agentic, AI Assistant, Meet Alfred and this is how I built him.

JOOJO DONTOH — Mon, 16 Mar 2026 15:39:44 +0000

Introduction

My people it's me again. This time I have built something fun but mostly useful. I gave building an autonomous agent a chance and it's turning out well. I know it's a cliché but his name is Alfred. The thing is AI agents are no longer a novelty. It all started out as simple chatbots chaining a few prompts together. Now it has evolved into something far more capable. These systems that can "reason" (I know it's just a lot of math and not actual reasoning), plan, use tools, and execute multi-step workflows with minimal human intervention. Agentic flows, where an AI iteratively breaks down a goal, takes actions, evaluates results, and course-corrects, are quickly becoming the backbone of serious productivity tooling.

But the thing is not all models are created equal. The market is crowded. GPT-4o, Gemini, Mistral, Llama, DeepSeek all have their own strengths, trade-offs, and devoted user bases. Picking the right model for a given task has become something of an art form in itself. Especially because the benchmarks keep getting blurrier and blurrier.

For me, that choice keeps coming back to Anthropic's Claude and specifically to Opus. As an engineer, I spend a significant portion of my day thinking in systems: abstractions, edge cases, failure modes and architecture trade-offs. Opus is the only model that consistently feels like it's doing the same while cleverly grabbing my immediate system context. Where other models can produce code that technically compiles but misses the intent entirely, Opus tends to understand the why behind what I'm building, not just the what. That distinction, subtle as it sounds, makes an enormous practical difference when you're deep in a complex codebase. Opus has downsides, especially because sometimes it takes shortcuts without adhering to the principles you intended.

On the bright side what sealed it for me, though, was the CLI experience. Claude's command-line interface is genuinely pleasant to use: fast, composable, and unobtrusive in a way that fits naturally into my existing workflow. It doesn't feel like a detour. It feels like a tool that belongs in your terminal alongside the rest of my stack.

In this article I'm going to talk about why I needed Alfred, the problem he solves for me, how I built him and how I improve him on this ever changing landscape where engineering meets productivity.

The Monday Morning Problem Every Developer Knows

It is Monday, 8:30 AM. Before I have written a single line of code, I already have a full-time job just figuring out where to start.

Over the weekend, 47 new Gmail messages came in. Some are spam. Some are newsletters I never unsubscribed from. But buried somewhere in that pile is an escalation that needs urgent attention and a teammate asking for a code review. I do not know which email it is yet. I have to dig for it.

That is just Gmail. I also have 12 Outlook emails from work: meeting updates, an HR policy change, and my manager asking about feature progress. Then there are 8 Teams messages spread across 3 different channels covering a production incident from Saturday, a design review thread, and standup notes. On top of that, 3 pull requests were opened against repos I review, and 2 calendar conflicts appeared for Tuesday that I need to sort out before the day gets going.

None of these systems talk to each other. So my morning routine becomes a manual context-switching exercise. I open Gmail, scan subject lines, try to mentally rank urgency. Then I switch to Outlook and do the same. Then Teams. Then Azure DevOps. By the time I have a rough picture of what actually needs my attention, 45 to 60 minutes have passed. And that client escalation? Still buried under newsletters when I finally find it.

The frustrating part is that most of that time is not real work. It is just triage. It is the overhead that comes before the actual job even starts. The other option is to close everything and wait for someone to walk to my table. Lmao I do this all the time.

But well, this is the problem I built Alfred to solve.

What do I want from Alfred?

Unification! Alfred is a personal AI agent built around a single idea: collapsing the chaos of my digital workday into one intelligent, unified system. It continuously polls Gmail at configurable intervals and receives Outlook emails and Microsoft Teams messages via Power Automate webhooks, storing everything locally in SQLite so that regardless of the source, nothing slips through the cracks.
Every incoming email is then put through an AI classification pipeline that assigns it one of six categories (Urgent, Personal, Work, Newsletter, Transactional, or Spam), gives it a priority level from 1 to 5, generates a human-readable summary, extracts action items with optional due dates, and flags whether a follow-up is needed.
From there, a configurable rules engine evaluates each classified email and proposes an appropriate action: archive it, delete it, forward it, draft a reply, or surface it for attention via a notify action with quick-action buttons.
Destructive actions like deletions, sends, and PR approvals wait behind an explicit approval gate in the dashboard, while non-destructive ones like classification and drafting execute automatically.
Every action is tracked through a full lifecycle from proposed to executed, with timestamps, rollback data, and execution results all stored in an append-only audit log.

Beyond email, Alfred integrates deeply with the rest of my work toolchain. It connects to Google Calendar and Outlook Calendar for listing, creating, updating, and searching events, and handles Azure DevOps for querying and managing work items, approving pull requests, tracking pipeline runs, and browsing repositories. When a pull request is opened, a dedicated webhook handler automatically fetches the PR details, checks pipeline status, attempts to link related work items from branch name patterns, generates an LLM summary, and proposes approval or work item creation actions accordingly. Microsoft Teams is covered too, with channel message search and webhook-based ingestion keeping Alfred aware of conversations happening outside of email. Tying everything together is a conversational chat interface powered by an agentic loop that extracts intents from natural language, executes them across services, and returns structured, context-aware responses.

Let's look at some of Alfred's core flows in detail

Email Polling and Synchronization

Alfred's background worker is built around an AgentLoop flow. When the server starts, the agentLoop runs an initial poll immediately, then sets a repeating setInterval timer at a configurable cadence. Each tick calls a listMessages request emailPort.listMessages("in:inbox", 50) to fetch up to 50 messages from Gmail via the Gmail API. 50 is a reasonable number for my personal workflow

To avoid reprocessing emails Alfred has already seen, the loop maintains an in-memory string set of message IDs. Every polled message is checked against this set, and only genuinely new messages pass through:

const newMessages = messages.filter((msg) => !this.seenIds.has(msg.id));
for (const msg of newMessages) {
  this.seenIds.add(msg.id);
}

New messages are immediately persisted to SQLite through EmailRepo.upsert(). The upsert uses SQLite's INSERT ... ON CONFLICT(id) DO UPDATE pattern, which means if Alfred encounters the same email ID twice (for example after a server restart), it updates the existing row rather than creating a duplicate. The repository stores the full email body, sender, recipients, labels, attachments as serialized JSON, and a source field that distinguishes Gmail emails from Outlook emails. I cover the exact upsert schema in the Data Integrity section.

Before sending any email to the classifier, the loop applies a set of skip rules. Social media notifications from Facebook, Instagram, Twitter, TikTok, Reddit, Discord, and similar platforms are matched by regex against the sender address. Emails carrying Gmail's CATEGORY_PROMOTIONS or CATEGORY_SOCIAL labels are also skipped. LinkedIn is explicitly exempted from this filter because its emails often contain actionable professional content. This pre-filtering avoids burning LLM API calls on emails that would reliably classify as low priority anyway.

The loop also checks whether each email already has a classification in the database before sending it to the classifier. If a record exists, the email is skipped entirely. This means restarting the server does not trigger re-classification of previously processed emails. I wrote it this way to ensure minimum cost and idempotency.

When the classifier encounters a fatal error such as an expired API key, exhausted credit balance, or a 429 rate limit response, the loop enters a paused state rather than crashing or retrying in a tight loop. It sets classifierPaused = true and stops classifying. This is sort of a circuit breaker. On subsequent polls, it still persists new emails to the database so no mail is lost, but it attempts a single test classification to check whether the service has recovered. Once the test succeeds, classification resumes automatically. Error messages are also deduplicated so the same error is only logged once regardless of how many polls occur while paused.

For Outlook, Alfred does not poll directly. Instead, an adapter calls a Power Automate flow that returns Outlook messages. A dedicated payload mapper normalizes Microsoft field names, timestamp formats, and nested structures into the same EmailMessage domain object that Gmail produces. This means the rest of the pipeline, including classification, action rules, and chat, works identically regardless of whether an email originated from Gmail or Outlook. I wrote it this way so that I can later extend email providers by just adding a normalization mapper and then it should be plug and play.

Action Proposal, Approval, and Execution

Actions in Alfred follow an event-sourced lifecycle. Every state transition is recorded as an append-only entry in action log in an SQLite table. No rows are ever updated in place or deleted. The lifecycle flows through a fixed set of ActionStatus states: Proposed → Approved → Executed, or alternatively Rejected or RolledBack. This is purely for auditing so that I can track autonomous actions from the agent.

Proposal

The ProposeAction use case starts with an idempotency check. It queries the action log for any existing entry with the same resourceId and type. If one already exists, it returns null and stops. Otherwise, it appends a new entry with status: Proposed.

From there, the action's RiskLevel determines what happens next. Low-risk actions like Classify, Draft, and Notify carry RiskLevel.Auto and execute immediately without my input. High-risk actions like Archive, Delete, Send, and Forward carry RiskLevel.ApprovalRequired and sit in the proposed state until I act on them from the dashboard:

const risk = ACTION_RISK_LEVELS[action.type];
if (risk === RiskLevel.Auto) {
  const strategy = this.strategies.find((s) => s.source === action.source);
  if (strategy?.canExecute(action.type)) {
    resultData = await strategy.execute({ type, resourceId, payload });
  }
  await this.actionLog.updateStatus(actionId, ActionStatus.Executed, new Date().toISOString());
}

If the action produces result data such as a created draft ID or classification details, that data is stored alongside the log entry via updateResultData().

Approval and Execution

When I click Approve in the dashboard, the ApproveAction use case first updates the log entry's status to Approved with a timestamp, then immediately attempts execution. It finds the correct ActionExecutionStrategy by matching the action's source field. Three strategies exist: GmailActionStrategy handles archive, delete, send, and draft operations via the Gmail API; OutlookActionStrategy handles equivalent operations through Power Automate; and DevOpsActionStrategy handles work item creation and PR approval via the Azure DevOps REST API. This is based on the open-closed principle to allow for the extension and registration of multiple strategies.

Each strategy declares which action types it supports through a canExecute() method. If a strategy exists but cannot execute the specific action type, the action is marked as executed without performing any real mutation. If execution succeeds, the status moves to Executed. If it fails, the error is returned to the caller but the action remains in Approved state so the user can retry without losing the approval.

The Notify action type is intentionally a no-op at the execution level. It exists so the rules engine can propose surfacing an email to the user without triggering any mutation on the mailbox. The notification itself is handled by the push notification system, not the action executor.

Chat Interface (Intent and Tool Use Modes)

Alfred's chat is the primary way I interact with my workspace data through natural language. I designed it to support two distinct modes of operation, an intent extraction mode (the default) and tool_use mode powered by Anthropic's internal tool choice algorithm. Both implement a ChatStrategy interface defined in a chat-strategy file, which standardises the input (message, history, context, system prompt, dependencies) and output (response text, result strings, action steps).

Intent Extraction Mode

The IntentExtractionStrategy uses a two-LLM architecture. A fast, cheap model (Claude Haiku) handles intent extraction, while the main model (Claude Sonnet) composes the final user-facing response.

The strategy runs an agentic loop of up to 5 rounds. In each round, it sends the user's message, the last 20 conversation history entries (each truncated to 2000 characters), and any results from prior rounds to the fast LLM. The system prompt includes detailed routing rules that map natural language patterns to intent types: "check my Outlook" routes to search_emails with source: "outlook", "calendar" without a provider routes to list_calendar_events without a source, and "work items" routes to query_work_items.

The LLM returns a JSON object with an intents array. Each intent specifies a type matching a registered tool name, along with type-specific fields like query, source, and timeMin. Invalid tool names are filtered out against the ToolRegistry. The strategy then executes each intent by calling the corresponding tool's execute() function, which delegates to the appropriate IntentExecutorDeps method:

for (let round = 0; round < MAX_ROUNDS; round++) {
  const intents = await this.extractIntents(extractionLlm, message, recentHistory, priorResults, deps, validToolNames);
  if (intents.length === 0) break;
  const results = await this.executeTools(deps, intents);
  allResults.push(`--- Round ${round + 1} ---\n${results.join("\n\n")}`);
}

Multi-round execution is what makes complex queries possible. A request like "invite Sabrina to my 3pm meeting tomorrow" requires two rounds: round 1 searches for tomorrow's calendar events, and round 2 uses the event ID from that result to update the event with a new attendee. The LLM receives prior results in an ACTIONS ALREADY EXECUTED THIS TURN block and can return {"intents": [{"type": "none"}]} to signal that all needed data has been gathered and the loop should stop.

After the loop completes, the ChatService combines all gathered results with local context (email stats, pending actions, and follow-ups from the database) and sends everything to the main LLM for final response composition, with extended thinking enabled.

Tool Use Mode

The ToolUseStrategy takes a fundamentally different approach. Rather than extracting intents and executing them as a separate step, it gives the LLM direct access to tools via completeWithTools(). The LLM decides which tools to call, receives structured results, and continues the conversation until it produces a final text response.

This mode requires the LLM adapter to support the Claude tool-use API. The strategy converts all registered tools into Claude tool definitions (name, description, input schema) and passes them alongside the message. The loop runs for up to 5 rounds, checking the stopReason after each response. When the model returns end_turn, the final text becomes the response. When it returns tool calls, the strategy executes each tool, packages the results as ToolResultBlock objects with matching tool_use_id, and sends them back as a user message for the next round:

const response = await deps.llm.completeWithTools({ system: systemPrompt, messages, tools, maxTokens: 4096 });
if (response.stopReason === "end_turn") {
  return { response: response.text ?? "", results: allResults, actions: allActions };
}

If the model exhausts all 5 rounds without reaching end_turn, the strategy returns a graceful fallback message in Alfred's butler voice rather than surfacing a raw error to the user.

Tool Registry

Both modes share the ToolRegistry class in a tool-registry file, which acts as a central catalogue of all available tools. Each tool is registered with a name, description, JSON input schema, an execute function, and a summarize function that produces human-readable action steps such as "Searched Gmail for 'invoice'". The registry can export its tools in two formats: toToolDefinitions() for Claude's native tool-use API, and toIntentPrompt() for building the intent extraction system prompt.

System Prompts

All persona and mode-specific instructions are centralised in a system-prompts file. The BASE_PERSONA establishes Alfred's character as a refined English butler who addresses the user as "Master Jo" and has access to Google Workspace, Microsoft 365, and Azure DevOps. (Jeremy Irons is my favorite Alfred btw) Mode-specific instructions are appended on top: intent mode tells Alfred that actions have already been executed and results are in context so it should not pretend to be searching, while tool-use mode tells Alfred to actively call tools to fetch fresh data.

Authentication and Security

Alfred enforces security at multiple levels across both the dashboard and the agent server.

Dashboard Authentication

The dashboard uses NextAuth.js v5 configured in auth.ts with Google OAuth as the sole provider. Sessions use a JWT strategy with a 7-day maximum age. Access is restricted to a single authorised user through an email allowlist: the signIn callback compares the Google profile's email against the ALLOWED_EMAIL environment variable and rejects any mismatch:

callbacks: {
  signIn({ profile }) {
    return profile?.email?.toLowerCase() === allowedEmail;
  },
}

The auth system uses a custom sign-in page at /auth/login and redirects errors back to the same page for a clean user experience. Since Alfred is a personal, single-user tool, the allowlist approach is both simpler and more appropriate than a full role-based access system.

Server-Side Credentials

The agent server stores sensitive credentials in the macOS Keychain. Both are fetched lazily on first use and cached in memory for the lifetime of the process. This means credentials never appear in environment variables, config files, or logs.

Architectural Isolation

The dashboard is a pure client-rendered application. It contains no provider SDK imports, no direct database access, and no secret values. All data access flows through the agent server's HTTP API. I made sure that all credentials are ignored. This means that even if the dashboard source code were fully exposed, it would not leak any credentials or grant any access to the underlying data.

Resilience and Caching

Alfred applies several resilience patterns across the system to handle network failures, API rate limits, and performance constraints.

In-Memory TTL Cache

The TtlCache class in cache.ts provides a simple time-to-live cache backed by a JavaScript Map. Each entry stores its data alongside an expiresAt timestamp. The get() method checks expiration on every access and automatically evicts stale entries. The getOrFetch() method combines cache lookup with lazy population:

async getOrFetch<T>(key: string, ttlMs: number, fetcher: () => Promise<T>): Promise<T> {
  const cached = this.get<T>(key);
  if (cached !== undefined) return cached;
  const data = await fetcher();
  this.set(key, data, ttlMs);
  return data;
}

This is used for calendar events and DevOps data, both cached with a 3-minute TTL. During a multi-round chat conversation where Alfred might query the calendar several times, only the first call hits the API and subsequent calls return the cached result. The 3-minute window balances data freshness with meaningful API call reduction.

Agent Loop Resilience

The classifier pause behavior is covered in the Email Polling section above. Beyond that, the polling loop is designed so that a failure in any single stage — classification, action proposal, or action execution, does not crash or block the rest of the loop. Each stage fails independently and logs the error without taking down the whole cycle.

Power Automate Retries

The Power Automate client implements a 3-attempt retry with linear backoff (1s, 2s, 3s) for transient HTTP errors and timeouts. Non-retryable errors such as 4xx client errors (excluding 429) fail immediately without retrying. Each request uses AbortController with a 30-second timeout to prevent indefinite hangs.

Push Notification Delivery

The web push delivery mechanics including concurrent sends, Promise.allSettled(), and automatic cleanup of expired subscriptions are covered in the Push Notifications section under Discoveries where the full implementation is explained in context.

Deployment and Operations

Alfred runs as three persistent background services on macOS, managed by launchd, Apple's native process manager. The deployment system is entirely script-based with no containers, no cloud infrastructure, and no external process managers. Everything runs on a single Mac.

The Three Services

The agent server is the core process. It runs the Node.js HTTP API, the background email polling loop, the action execution pipeline, and the finance statement processor. It owns all external API calls to Gmail, Google Calendar, Anthropic, Azure DevOps, and Power Automate, along with all OAuth credentials stored in macOS Keychain and the SQLite database.

The dashboard is a Next.js application serving the client-rendered UI. In production it runs against a pre-built output directory and makes no direct calls to any external service. All data comes through the agent server's HTTP API. It receives a bearer token as an environment variable so it can authenticate its requests to the agent server.

The Cloudflare tunnel creates an encrypted outbound connection from the Mac to Cloudflare's edge network, making the dashboard publicly accessible without opening any inbound ports or touching the router. It routes HTTPS traffic from the public domain down to the local Next.js server on a local port.

launchd Service Configuration

Each service is defined as a .plist property list file. The plist files use placeholder tokens that are replaced with real values at deploy time using sed. The key properties are RunAtLoad: true to start on login, KeepAlive: true to auto-restart on crash, and ThrottleInterval: 10 to wait at least 10 seconds between restart attempts and prevent tight crash loops:

<key>ProgramArguments</key>
<array>
    <string>PROJECT_ROOT/node_modules/.bin/tsx</string>
    <string>apps/agent-server/src/index.ts</string>
</array>
<key>KeepAlive</key>
<true/>
<key>ThrottleInterval</key>
<integer>10</integer>

Each service logs stdout and stderr to separate files that can be tailed in real time for debugging.

The Deploy Script

Deployment runs through a single script that orchestrates six steps in order:

creating the log directory
sourcing the .env file to load environment variables
running npm install at the monorepo root to install all workspace dependencies
running npm run build to compile all TypeScript packages in dependency order (domain → application → infrastructure → contracts → agent server, then the Next.js dashboard)
copying each plist template into ~/Library/LaunchAgents/ with placeholders replaced by real paths,
And finally loading all three services with launchctl load to start them immediately. Before installing each plist it unloads any previously running version to prevent conflicts, resulting in a brief restart with minimal downtime:

for plist in com.alfred.agent.plist com.alfred.dashboard.plist com.alfred.cloudflared.plist; do
  launchctl unload "$LAUNCH_AGENTS_DIR/$plist" 2>/dev/null || true
  sed -e "s|PROJECT_ROOT|$PROJECT_ROOT|g" \
      -e "s|USER_HOME|$USER_HOME|g" \
      -e "s|CLOUDFLARED_BIN|$CLOUDFLARED_BIN|g" \
      -e "s|NODE_BIN_PATH|$NODE_BIN_PATH|g" \
      -e "s|BEARER_TOKEN_VALUE|${BEARER_TOKEN:-}|g" \
      "$DEPLOY_DIR/$plist" > "$LAUNCH_AGENTS_DIR/$plist"
done

The script automatically detects the Node.js binary path across nvm, Homebrew, and system installs, and locates the cloudflared binary for both Apple Silicon and Intel Homebrew paths. At the end it prints a macOS settings checklist reminding me to enable auto-login, prevent sleep, and configure startup after power failure, since the Mac effectively acts as a persistent home server.

First-Time Setup

Initial installation is handled by a setup script that checks prerequisites (Homebrew and Node.js 20 or above), installs cloudflared, creates the .env file interactively, runs the Google OAuth flow by opening a browser for consent and storing the resulting refresh token in Keychain, authenticates with Cloudflare, creates the tunnel, configures DNS routes, and then kicks off the deploy script to bring everything up.

Operational Commands

I have scripts for the full operational lifecycle. A status command shows whether each service is running, its PID, and the last 5 log lines. A teardown command unloads all services and removes the plist files from LaunchAgents while preserving logs. A universal launcher supports multiple modes: all for full production, dev for hot-reload development, agent or dashboard individually, status for health checks, and doctor for preflight validation.

Configuration

All configuration flows through environment variables loaded from a .env file at the project root. A config.ts module reads these and returns a typed AppConfig object. Three variables are required: GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET, and ANTHROPIC_API_KEY. Everything else is optional and enables features progressively. Setting AZURE_DEVOPS_ORG enables DevOps integration. Setting PA_FLOW_MAIL_SEARCH enables Outlook. Setting VAPID_PUBLIC_KEY enables push notifications and so on. If an optional config block is absent, the composition root simply skips registering those adapters and use cases, so the system degrades gracefully rather than failing to start.

Data Integrity

Ensuring that Alfred handles data meticulously was very important to me. It does not make sense to build an assistant that is sloppy with the information it presents. Therefore I wrote Alfred in such a way that he prevents duplicate and inconsistent data through idempotency checks, upsert semantics, and schema separation at every data boundary.

Idempotent Action Proposals

Before creating a new entry in the action log, the proposal system queries for any existing entry with the same resourceId and type. If a match is found, the proposal is silently skipped and returns null. This means the polling loop can encounter the same email multiple times, such as after a server restart, without generating duplicate action proposals:

const existing = await this.actionLog.getByResourceIdAndType(action.resourceId, action.type);
if (existing) return null;

Email Upsert Semantics

Whether an email arrives via polling, a webhook, or is encountered again after a restart, the upsert guarantees exactly one row per email ID. All fields including subject, body, labels, and read status are updated to their latest values, and an updated_at timestamp records when the last refresh occurred:

INSERT INTO emails (id, thread_id, from_address, ..., updated_at)
VALUES (@id, @threadId, @from, ..., datetime('now'))
ON CONFLICT(id) DO UPDATE SET
  thread_id = excluded.thread_id,
  from_address = excluded.from_address,
  ...
  updated_at = datetime('now')

Conversation Ordering

Chat messages are stored with a created_at timestamp and always queried in chronological order using ORDER BY created_at ASC. Messages are never reordered, edited, or deleted after creation. This ensures the conversation history Alfred sees when composing a response exactly matches what the user experienced.

Normalised Schema Design

Classifications are stored in a separate classifications table linked to emails by email_id. This separation means re-classifying an email, whether due to a model update or a rule change, only touches the classification row without affecting the underlying email data. The email's original content, headers, labels, and metadata remain untouched. Follow-ups and action log entries follow the same pattern. Each table has a single source of truth for its own data, and no operation on one table can corrupt another.

Pitfalls: From Intent Extraction to Tool Use

I started Alfred's chat system with a pure intent extraction approach. The idea was straightforward: send my message to a fast LLM, ask it to return structured JSON with an intent type and parameters, then map that intent to an executor function. A message like "show me today's calendar" would produce {"type": "list_calendar_events", "timeMin": "2026-03-16", "timeMax": "2026-03-16"}, and the system would call the calendar adapter directly:

const intents = await this.extractIntents(extractionLlm, message, recentHistory, priorResults, deps, validToolNames);
for (const intent of intents) {
  const entry = deps.toolRegistry.get(intent.type as string);
  if (entry) {
    const result = await entry.execute(deps.intentExecutor, intent);
    if (result) results.push(result);
  }
}

I built this following the Open/Closed Principle. Each intent type was a self-contained ToolEntry registered in a ToolRegistry. Adding a new capability meant registering a new entry with a name, schema, executor function, and summariser. No existing code needed modification:

toolRegistry.register({
  name: "search_emails",
  description: "Search emails by query, category, or sender",
  inputSchema: { ... },
  execute: async (deps, input) => { ... },
  summarize: (input) => `Searched emails: ${input.query}`,
});

In theory this was clean and extensible. In practice, the cost of adding intents started to compound. Every new capability required writing a system prompt fragment describing the intent format, adding routing rules so the LLM knew when to select it, writing the executor function, and testing that the LLM reliably produced the right JSON structure. At 5 intent types it was manageable. By the time I had 15 (email search, calendar list, calendar create, calendar update, calendar search, work item query, work item create, PR query, pipeline list, Teams messages, follow-ups, actions, repo list, commits, branch list), the intent extraction system prompt had ballooned. The LLM was juggling too many format rules and frequently produced malformed JSON or selected the wrong intent type.

The extraction prompt had grown to include detailed routing rules, source-specific provider logic, multi-intent support, and follow-up round awareness:

const INTENT_RULES = `
ROUTING RULES:
- "check my Outlook" → search_emails with source: "outlook"
- "search Gmail" → search_emails with source: "gmail"
- "Outlook calendar" → list_calendar_events with source: "outlook-calendar"
- "work items" / "tickets" → query_work_items
- "pull requests" / "PRs" → query_source_control with subtype: "pull_requests"
...
`;

Every new intent meant updating these routing rules, testing edge cases, and hoping the model did not confuse the new intent with existing ones. The Open/Closed architecture was holding up at the code level — I was not modifying existing executors, but the prompt was a single growing artifact shared by every intent. Adding one intent risked degrading the reliability of all the others.

This led me to Claude's native tool use API. Instead of asking the LLM to produce JSON matching my custom schema, I could give it proper tool definitions and let Claude's built-in tool calling handle the routing:

const tools = deps.toolRegistry.toToolDefinitions();
const response = await deps.llm.completeWithTools({
  system: systemPrompt,
  messages,
  tools,
  maxTokens: 4096,
});

Claude's tool use was noticeably more reliable. It natively understands tool schemas, validates parameters against the input schema, and handles multi-tool calls cleanly. The model picks the right tool more consistently than my intent extraction prompt ever did, because tool selection is a first-class capability of the model rather than something I was trying to engineer through prompt instructions.

But tool use burned through API credits quickly. Each round of the conversation becomes a full API call carrying the entire tool catalogue, conversation history, and system prompt. A simple question like "what meetings do I have today?" that previously cost one cheap Haiku call for intent extraction plus one Sonnet call for response composition now cost one or more full Sonnet calls with tool definitions attached, adding significant token overhead to every request.

I balanced models to keep costs sustainable. Intent extraction uses Haiku because it only needs to produce structured JSON, not reason deeply. Final response composition uses Sonnet with extended thinking enabled because that is where quality matters:

const strategyDeps = {
  llm: this.deps.llm,         // Sonnet — reasoning and response
  fastLlm: this.deps.fastLlm, // Haiku — intent extraction
  ...
};

Rather than committing to one approach, I gave the chat system the ability to switch between both modes. The mode parameter on each request selects the active strategy:

const strategy = mode === "tool_use" ? toolUseStrategy : intentStrategy;
const strategyResult = await strategy.run({ message, history, localContext, systemPrompt, deps });

Intent mode is cheaper and faster for straightforward queries where the routing rules work well. Tool use mode is more reliable for complex, ambiguous, or multi-step requests where maintaining routing rules would be impractical. Both strategies implement the same ChatStrategy interface and share the same ToolRegistry, so all capabilities are available in both modes without any duplication.

From Single Request-Response to Reasoning Loops

Early on, the chat used a single request-response pattern. I ask a question, Alfred gathers context from the database, sends everything to the LLM in one shot, and returns the response. The quality was poor. With 15+ tools and a rich system prompt, the model would frequently miss details, give shallow answers, or fail to connect information across multiple data sources. A question like "what's my schedule like tomorrow and do I have any overdue follow-ups?" would produce a partial answer because the model was trying to handle everything in a single pass.

My first instinct was to use a better model. I switched from Sonnet to Opus for the response composition step and the quality jumped immediately. Opus reasons more carefully, connects dots across context, and produces noticeably more nuanced responses. But it was expensive. Opus costs significantly more per token than Sonnet, and every chat message was a full context window call carrying email stats, action history, follow-up data, and conversation history.

This led me to implement reasoning loops. Instead of asking the model to do everything in one pass, I let it work iteratively. In intent mode, the strategy runs up to 5 rounds. Each round extracts intents, executes them, and feeds the results back into the next round's context:

for (let round = 0; round < MAX_ROUNDS; round++) {
  const intents = await this.extractIntents(extractionLlm, message, recentHistory, priorResults, deps, validToolNames);
  if (intents.length === 0) break;
  const results = await this.executeTools(deps, intents);
  allResults.push(`--- Round ${round + 1} ---\n${results.join("\n\n")}`);
}

In tool use mode, the loop is similar but driven by Claude's stop reason. The model keeps calling tools until it decides it has enough information and returns a final text response:

for (let round = 0; round < MAX_ROUNDS; round++) {
  const response = await deps.llm.completeWithTools({ system: systemPrompt, messages, tools, maxTokens: 4096 });
  if (response.stopReason === "end_turn") {
    return { response: response.text ?? "", results: allResults, actions: allActions };
  }
  // ... execute tool calls, feed results back
}

This multi-round approach means a request like "invite Sarah to my 3pm meeting tomorrow" works naturally.
Round 1 searches tomorrow's calendar events.
Round 2 uses the event ID from that result to update the event with a new attendee. The LLM sees prior results in an ACTIONS ALREADY EXECUTED THIS TURN block and returns {"intents": [{"type": "none"}]} when everything is resolved and the loop should stop.

{"timestamp":"2026-03-16T07:11:03.210Z","level":"info","msg":"\nchat:start","component":"chat","message":"What does my outlook calendar look like ?","historyLength":16,"mode":"tool_use"}
{"timestamp":"2026-03-16T07:11:07.854Z","level":"info","msg":"llm:completeWithTools","component":"llm","model":"claude-opus-4-6","inputTokens":8168,"outputTokens":131,"durationMs":4644,"stopReason":"tool_use"}
{"timestamp":"2026-03-16T07:11:07.854Z","level":"info","msg":"chat:tool-use-round","component":"chat","round":1,"stopReason":"tool_use","toolCallCount":1,"hasText":true,"durationMs":4644}
{"timestamp":"2026-03-16T07:11:07.855Z","level":"info","msg":"chat:tool-result","component":"chat","tool":"list_calendar_events","resultLength":33,"resultPreview":"Calendar Events: No events found."}
{"timestamp":"2026-03-16T07:11:13.314Z","level":"info","msg":"llm:completeWithTools","component":"llm","model":"claude-opus-4-6","inputTokens":8318,"outputTokens":120,"durationMs":5458,"stopReason":"end_turn"}
{"timestamp":"2026-03-16T07:11:13.315Z","level":"info","msg":"chat:tool-use-round","component":"chat","round":2,"stopReason":"end_turn","toolCallCount":0,"hasText":true,"durationMs":5459}
{"timestamp":"2026-03-16T07:11:13.315Z","level":"info","msg":"chat:complete","component":"chat","totalDurationMs":10106,"mode":"tool_use","actionCount":1}

The reasoning happens where it counts. Mechanical work like deciding which tools to call uses the cheapest model that can do it reliably, and the expensive synthesis step only fires once at the end. A 3-round conversation costs 3 Haiku calls plus 1 Sonnet call rather than 3 Opus calls.

Prompt Refinement

Prompt refinement turned out to be significantly harder with intent extraction than with tool use. With intent extraction, I was responsible for the entire instruction surface: routing rules, format specifications, edge case handling, multi-intent support, source disambiguation, date inference, and conversational context awareness. Every ambiguous user message required a new rule or clarification in the prompt. The prompt became a fragile, growing document where changing one section could silently break another.

With tool use, Claude does most of the heavy lifting. I define each tool's name, description, and input schema. Claude figures out when to call it, what parameters to pass, and how to combine results across multiple tools. The refinement effort shifted from "teach the model my custom intent format" to "write clear tool descriptions and let the model's built-in tool selection do its job." This was a dramatically smaller surface area to maintain.

The persona prompt is where I spent the most deliberate effort, and I structured it to follow the Open/Closed Principle. The BASE_PERSONA defines Alfred's character, his access to workspace systems, and the critical behavioural rules that apply regardless of which mode is active:

export const BASE_PERSONA = `You are Alfred, a distinguished personal workspace assistant. 
You are an old English gentleman — impeccably dressed in a three-piece suit at all times, 
refined in manner, and utterly devoted to your employer. You always address the user as 
"Master Jo". Your speech carries the quiet authority and warmth of a seasoned butler...

CRITICAL RULES:
- ALWAYS address the user as "Master Jo"
- ONLY use the data provided to you. Do not make up emails, events, or results.
- When calendar events were CREATED, confirm this to the user with details and calendar links.
...`;

Mode-specific instructions are appended on top without touching the base. Intent mode tells Alfred that actions have already been executed and results are already in context, so he should not pretend to be searching. Tool use mode tells Alfred to actively call tools to fetch fresh data. The buildSystemPrompt() function composes these cleanly:

export function buildSystemPrompt(mode: "intent" | "tool_use"): string {
  const modeInstructions = mode === "tool_use" ? TOOL_USE_MODE_INSTRUCTIONS : INTENT_MODE_INSTRUCTIONS;
  return BASE_PERSONA + "\n" + modeInstructions;
}

This separation means I can refine Alfred's personality, add new behavioural rules, or adjust mode-specific instructions entirely independently. Adding a new mode in the future means writing a new instruction block and adding a case to buildSystemPrompt(), without touching the persona or any existing mode instructions.

The persona itself evolved through iteration. Early versions were too stiff and formal. Later versions overcorrected and became too casual. The current version balances warmth with efficiency, giving Alfred permission to be dry-witted and occasionally opinionated while staying concise and never fabricating data.

Discoveries

The Floodgate Effect

Once I had the first working version of Alfred deployed, something unexpected happened: my mind would not stop generating ideas. The initial version could poll Gmail, classify emails, propose actions, and let me approve them from a dashboard. It was functional, but using it every day exposed gaps and opportunities I had not anticipated during planning. Every morning I would open the dashboard, see how Alfred handled my overnight inbox, and think "what if he could also do this?" The backlog grew faster than I could build.

This is something I did not expect about building a personal tool. When you are the only user, the feedback loop is immediate. There is no product manager filtering requests, no sprint planning, no prioritisation meetings. You feel the friction directly, and the fix is always within reach. That immediacy is both a gift and a trap. I had to learn to be disciplined about scope, because every "quick addition" carries a maintenance cost that compounds.

Financial Statement Processing

The first major expansion came from a personal pain point. I bank with multiple banks in Malaysia, and both send monthly e-statements as password-protected PDF attachments to my Gmail. Every month I would download the PDFs, unlock them, manually scan through transactions, and try to categorise spending in a spreadsheet. I actually stopped this a long time ago. It was tedious, error-prone, and I rarely kept up with it. I realised Alfred already had the infrastructure to solve this: he polls Gmail, he can download attachments, and he has an LLM for classification.

I built a six-stage pipeline that runs automatically during each polling cycle. Alfred searches Gmail for emails from the configured bank sender addresses, filters for emails with PDF attachments, and checks each against the bank_statements table to skip already-processed ones. The idempotency check matters because the polling loop runs every 60 seconds and the same bank emails will appear in search results repeatedly:

private async findUnprocessedIds(bank: BankConfig, filters: EmailSearchFilters): Promise<string[]> {
  const ids = await this.deps.emailRead.searchFilteredIds(filters);
  const unprocessed: string[] = [];
  for (const id of ids) {
    if (!(await this.deps.statementRepo.isStatementProcessed(id))) {
      unprocessed.push(id);
    }
  }
  return unprocessed;
}

For each unprocessed email, Alfred downloads the PDF attachment and decrypts it using the bank-specific password from environment config. This is where I hit the first real bug. The pdf-parse library accepts a password option, but its internal implementation completely ignores it. It passes the raw buffer directly to PDF.js's getDocument() instead of wrapping it in { data, password }. Every statement was failing with a cryptic "No password given" error. The fix was a workaround that tricks pdf-parse by passing a PDF.js parameter object in place of the buffer:

const pdfInput = { data: new Uint8Array(pdfBuffer), password } as unknown as Buffer;
const result = await pdf(pdfInput);

After decryption, the raw text goes to a bank-specific parser. Each bank formats its statements differently, so I built a StatementParserRegistry that routes to the correct parser based on the BankProvider enum.

The parser also strips page noise including headers, footers, and the Chinese and Malay translations that some banks include on every page, and collects multi-line transaction details like merchant names and reference numbers.

Once parsed, transactions go through a hybrid classification stage. The HybridTransactionClassifier first attempts rule-based categorisation using keyword matching (merchant names like "GRAB" map to transport, "MCDONALD'S" maps to food), and falls back to Claude Haiku for ambiguous transactions. This hybrid approach keeps costs low because most transactions have recognisable merchant names that do not need LLM inference.

The pipeline also handles historical backfill. On first run, it does not just process recent statements. It walks backward through the inbox month by month, processing older statements until it reaches a configurable cutoff, defaulting to 12 months. A backfill_state table tracks the cursor position per bank so the backfill can resume across server restarts:

private async processBackfill(bank: BankConfig): Promise<void> {
  const isComplete = await this.deps.backfillStateRepo.isComplete(bank.bankProvider);
  if (isComplete) return;

  const cursor = await this.deps.backfillStateRepo.getCursor(bank.bankProvider);
  const cutoff = new Date();
  cutoff.setMonth(cutoff.getMonth() - this.deps.backfillMonths);
  // ... fetch historical emails before cursor, process, advance cursor
}

All of this produces a normalised finance_transactions table where every transaction from every bank shares the same schema: date, description, amount, type (credit or debit), balance, category, merchant name, and statement period. Two banks, different formats, one unified table.

Making Financial Data Conversational

Having the data in SQLite was useful on its own, the dashboard has a Finance page with tables and charts, but the real power came from wiring it into Alfred's chat. I registered finance-specific tools in the ToolRegistry so that both chat modes can query transaction data naturally.

The chat can now answer questions like "how much did I spend on food last month?", "what were my biggest transactions in February?", or "show me all Grab transactions this year." Alfred queries the finance_transactions table, aggregates the results, and presents them in his butler persona.

What I did not anticipate is that this naturally enabled budgeting. Once Alfred could tell me "you spent RM 2,400 on dining in February, Master Jo," I started asking follow-up questions like "is that more than January?" and "set a reminder if I go over RM 2,000 next month." The transaction data combined with the follow-up system and push notifications created a lightweight budget monitoring capability that I never explicitly designed. It emerged from the intersection of features that already existed.

Progressive Web App

The dashboard started as a standard Next.js web app accessed through a browser tab. It worked, but it felt disposable. I would forget to check it, or close the tab and lose my place. Making Alfred a Progressive Web App changed that relationship. With a PWA manifest, a service worker, and the right meta tags, Alfred became an app I could install on my phone and in my Mac's dock. It has its own window, its own icon, and it persists across reboots.

The practical difference is small since it is still the same Next.js app behind the scenes. But the psychological difference is significant. An app in the dock feels like a tool. A browser tab feels temporary. I open Alfred every morning now the way I open Slack or my email client. It has presence.

Push Notifications with Service Workers

The feature I am most proud of is the push notification system. Before I built it, Alfred was purely pull-based. I had to open the dashboard to see if anything needed attention. Proposed actions would sit in the approval queue for hours because I simply forgot to check. Follow-ups would go overdue silently.

Push notifications made Alfred proactive. When the classification pipeline proposes a new action for approval, Alfred sends a push notification to my browser. When a high-priority email arrives, he notifies me immediately. When a DevOps PR webhook fires, I get a notification with a deep link straight to the approvals page.

The implementation uses the Web Push protocol with VAPID keys for authentication. The SendNotification use case checks user preferences before sending. I can toggle notifications per event type from the Settings page, and for high-priority emails I can set a minimum priority threshold:

const pref = await this.preferenceRepo.get(event.type);
if (pref && !pref.enabled) return;

if (event.type === NotificationEventType.HighPriorityEmail && emailPriority !== undefined) {
  const threshold = PRIORITY_THRESHOLDS[minPriority] ?? PRIORITY_THRESHOLDS.high;
  if (emailPriority > threshold) return;
}

The WebPushAdapter sends to all registered browser subscriptions concurrently using Promise.allSettled(), so a failed delivery to one device does not block others. It automatically cleans up expired subscriptions when the push service returns HTTP 410 or 404, which happens when a user clears browser data or uninstalls the PWA.

On the client side, a service worker listens for push events and displays native OS notifications with the app icon, a body preview, and a deep link URL. The notificationclick handler is smart about reusing existing windows: if the dashboard is already open, it focuses that tab instead of opening a new one:

self.addEventListener("notificationclick", (event) => {
  event.notification.close();
  const url = event.notification.data?.url ?? "/";
  event.waitUntil(
    self.clients.matchAll({ type: "window", includeUncontrolled: true }).then((clients) => {
      for (const client of clients) {
        if (client.url.includes(url) && "focus" in client) return client.focus();
      }
      return self.clients.openWindow(url);
    }),
  );
});

The usePushNotifications React hook manages the entire subscription lifecycle from the UI: checking browser support, requesting notification permission, fetching the VAPID public key from the server, subscribing via the Push API, and sending the subscription details to the server for storage. Unsubscribing reverses the process, removing the subscription from both the browser and the server database.

What made this feel like a real discovery is how it changed my workflow. Before push notifications, Alfred was a dashboard I checked. After push notifications, Alfred is an assistant who taps me on the shoulder. The difference between pull and push is the difference between a tool and a colleague. When my phone buzzes with "Action: archive. Proposed archive for 'Your NIKE order has shipped', Master Jo," I smile every time. It feels like Alfred is actually there, running the household.

Further Implementations

Retrieval-Augmented Generation for Personal Knowledge

The next frontier I want to explore is giving Alfred deep knowledge of everything I have written. I publish articles, write tweets, draft technical documentation, and take notes across multiple platforms. Right now Alfred knows my emails, my calendar, and my finances, but he does not know my voice. If someone asks me to write a thread about Clean Architecture, I start from scratch every time. If I need to reference a point I made in an article six months ago, I have to search manually.

I plan to build a RAG pipeline that indexes my published content, tweets, notes, and drafts into a vector store. A good friend of mine (Edem Kumodzi) already does this, read his article here. When I ask Alfred to help me write something, he would retrieve relevant passages from my own prior work and use them as context for generation. The goal is not for Alfred to write as me, but to write with full awareness of what I have already said, how I say it, and what positions I have taken. He should be able to say: "Master Jo, you wrote about this exact topic in your March article. Shall I pull the relevant points as a starting foundation?"

This is a step toward something larger. I want Alfred to have a total embodiment of who I am — not a shallow personality clone, but a deep contextual understanding of my thinking, my writing style, my professional opinions, and my personal preferences. He should know that I care about Clean Architecture and SOLID principles, that I have strong opinions about over-engineering, and that I prefer concise explanations with concrete examples. At the same time, he should remain his own person: a distinct entity with his butler persona who assists me rather than pretending to be me. The line between "knows me well" and "impersonates me" is one I want to walk carefully.

Expanding Service Integrations

Alfred currently connects to Google Workspace, Microsoft 365, and Azure DevOps. I want to push further into the services that shape my daily life.

WhatsApp is where most of my personal communication happens. The ability to search messages, get summaries of group conversations I have missed, or draft replies through Alfred would close a major gap. The challenge is that WhatsApp's API is designed for businesses rather than personal use, so I will likely need to explore the WhatsApp Business API with creative workarounds.

LinkedIn is the integration I am most excited about. I got the idea from a podcast about the discipline of maintaining professional relationships, and it resonated because I am genuinely terrible at it. I connect with people at conferences, have great conversations, and then never follow up. Alfred could do something far more personal than LinkedIn's built-in "keep in touch" feature: track my connections, identify people I have not interacted with in a while, cross-reference them with my calendar and email history, and nudge me with context. Not just "you haven't talked to Sarah in 3 months" but "you haven't talked to Sarah in 3 months. You last discussed the migration project at her company. She posted about a promotion last week. Shall I draft a congratulations message, Master Jo?" That level of contextual nudging is what turns a contact list into actual relationships.

Spotify might seem like an odd fit for a workspace assistant, but I spend a significant amount of my commute and focus time listening to engineering podcasts. I want Alfred to suggest relevant episodes based on what I am currently working on. If I am deep in a week of building a notification system, Alfred could recommend episodes about push notification architecture, service workers, or PWA best practices. The Spotify API is well-documented with solid search and recommendation endpoints, so this should be one of the more straightforward integrations to build.

Smart Home Integration

I have been thinking about extending Alfred beyond the digital workspace and into my physical space. Apple Shortcuts provides a bridge between software and home devices. If I can trigger Shortcuts programmatically, Alfred could control lights, check device status, set scenes, and interact with HomeKit accessories through natural language.

The most entertaining use case involves Juliana, my robot vacuum. She runs on a schedule, but I never actually know if she has finished cleaning or got stuck under the couch again. If I can query her status through a Shortcut or her manufacturer's API, Alfred could include in my morning briefing: "Juliana completed her cleaning cycle at 3 AM, Master Jo. All rooms covered, no incidents to report." Or more usefully: "Juliana appears to be stuck in the bedroom. She has not moved in 40 minutes. Shall I send a rescue party?"

The broader vision is for Alfred to be aware of my home the same way he is aware of my inbox. When I ask "is everything in order?", he should be able to answer with a status report covering emails, calendar, pending approvals, financial alerts, and whether the house has been cleaned. A proper butler would never limit his awareness to just the mail.

A Second Persona

My girlfriend has watched me use Alfred. This sparked an idea I had not considered: cloning Alfred's architecture for a second persona. The entire system is built on Clean Architecture with dependency injection, which means the persona, the rules, and the connected accounts are all configurable. The core infrastructure covering polling, classification, the action lifecycle, push notifications, and chat strategies is entirely provider-agnostic and user-agnostic.

In theory, creating a second instance means standing up another agent server pointed at different OAuth credentials, a different SQLite database, a different set of action rules, and a different system prompt. The persona would not be Alfred. She would get her own character, her own name, and her own way of speaking. But underneath, the same ChatService, the same ToolRegistry, the same AgentLoop, and the same strategy pattern would power everything.

The part that interests me most is how the persona shapes the experience. Alfred's butler character is not just flavour text. It affects how he delivers bad news ("I regret to inform you, Master Jo, that your credit card statement shows a rather generous dining budget this month"), how he prioritises information, and how he handles ambiguity. A different persona for a different person would need to match their communication style and preferences entirely. This is where the buildSystemPrompt() architecture pays off. The base capabilities and mode-specific instructions stay constant, while the persona layer is a separate, swappable block. Building a second agent is less about rewriting code and more about crafting a new character who happens to run on the same engine.

Conclusion

Building Alfred started as a weekend experiment: a polling loop that checked Gmail and labelled anything that looked important. What it became, over months of iteration, is something I did not fully anticipate: a personal operating system that sits between me and the noise of digital life.

The biggest lesson was not technical. It was architectural. Clean Architecture is not just an academic exercise you draw on whiteboards. It is the reason I was able to bolt on Microsoft Teams notifications, bank statement processing, and a full chat interface without rewriting the core. When your domain layer knows nothing about Gmail, adding Outlook is just another adapter. When your use cases speak in ports, swapping Claude Haiku for Sonnet is a one-line change in the composition root. The upfront cost of drawing those boundaries paid for itself ten times over.

That said, the path was not smooth. The jump from intent extraction to native tool use humbled me. Prompt engineering is not engineering in the traditional sense. There is no compiler to catch your mistakes, no type system to lean on. You ship a prompt, watch it hallucinate a tool name that does not exist, and go back to the drawing board. The multi-round reasoning loop took more iterations than any other feature, not because the code was complex, but because coaxing an LLM into reliable, structured behaviour across multiple turns is genuinely hard. Every fix revealed a new edge case. Every edge case demanded a new constraint in the system prompt. I have a much deeper respect now for anyone building production agentic systems.

The discovery that surprised me most was how naturally financial data fit into the system. I built Alfred to manage emails. The fact that bank statements arrive as email attachments meant the entire PDF extraction and transaction classification pipeline was, architecturally, just another use case plugged into the same ports. The backfill system, the hybrid classifier, the per-bank parser registry: none of it required changes to the core domain. That is Clean Architecture doing exactly what it promises.

Running everything on a Mac on my desk with a Cloudflare Tunnel was a deliberate choice. There is no monthly cloud bill. There is no cold start. My data never leaves my network unless I am the one requesting it through an encrypted tunnel. For a personal assistant that reads your email, knows your calendar, and processes your bank statements, that is not a nice-to-have. It is a requirement.

Alfred is far from finished. RAG-powered memory, WhatsApp integration, smart home control: the roadmap is long. But the foundation is solid. Every new capability I have added has reinforced the same pattern: define a port, write the use case, build the adapter, wire it in the composition root. The system grows without becoming fragile because each piece knows only what it needs to know.

If there is one thing I would tell someone starting a similar project, it is this: invest in the boundaries early. Not the features, not the UI, not the clever LLM tricks. The boundaries. Get the dependency direction right. Make your domain layer boring. Let your infrastructure layer be the only place that knows about the outside world. Everything else follows from that discipline. Alfred taught me that the most powerful personal software is not the one with the most features. It is the one you can keep evolving without fear of breaking what already works.

See you in the next one 😁

Increasing Technical Onboarding Velocity for Your Engineering Team

JOOJO DONTOH — Fri, 05 Dec 2025 06:16:26 +0000

TLDR

Problem: Engineers change teams frequently, and slow onboarding wastes everyone's time.

Goal: Minimise the time between git clone and first meaningful pull request.

Key practices:

Setup scripts that interactively guide new engineers through environment configuration
Code formatting tools (Prettier, ESLint) committed to the repo so standards are automatic
Brief READMEs focused on "how to run this" rather than business context
Descriptive file naming (transaction.service.ts, pos.client.ts) so the codebase is navigable
Comprehensive tests that serve as living documentation and give new engineers confidence to make changes
Hooks to catch issues before they reach code review
Protected branches and required approvals to prevent accidental mistakes
Common libraries and pipeline templates for multi-service teams

Result: New engineers get productive in days, reviews are shorter, and service owners spend less time hand-holding.

Trade-off: Requires upfront investment, but pays dividends with every new team member.

Introduction

Hello my people, its me again 😄. Today I want to talk about Engineering onboarding. So what is that? 🤔 In very simple terms, it is the journey between an engineer's first introduction to a codebase and the moment they can confidently open a meaningful, safe pull request. It's that critical window where confusion transforms into contribution.

The global job landscape for software engineers is highly dynamic and volatile. According to Zippia's analysis of over 100,000 software developer profiles, 69% of software engineers have a tenure of less than 2 years at their current job. At large tech companies, this number skews even shorter, with average tenures ranging from 1 to 3 years. The tech industry also carries one of the highest turnover rates across all industries, estimated at 13.2% according to LinkedIn workforce data. This reality means that more engineers than ever will find themselves in onboarding situations throughout their careers.

Onboarding isn't limited to new hires either. It happens when engineers switch teams internally, when a service gets transferred from one squad to another, or when engineering resources are borrowed temporarily for critical projects. Each of these scenarios demands the same thing: getting someone productive in an unfamiliar codebase as quickly as possible.

As an engineering lead, I've seen firsthand how a rough onboarding experience can slow down delivery, frustrate talented people, and introduce risk into production systems. This article aims to share practical strategies for making onboarding smooth and fast, while minimising the fear of new team members accidentally breaking things.

A quick note: onboarding involves non-technical aspects as well, such as team rituals, communication norms, and stakeholder relationships. Those matter deeply, but this article will focus specifically on the technical side of getting engineers productive and confident in your codebase.

Aspects of Knowledge a New Engineer Should Be Aware Of

Before an engineer can contribute meaningfully to a codebase, there are several knowledge areas they need to get up to speed on. Some of these are explicit and documented, others are tribal knowledge passed down through code reviews and hallway conversations.

Domain Knowledge

Understanding the business domain of the service you're working on isn't always a prerequisite for making changes. You can fix a bug in a fuel pricing service without fully understanding the intricacies of how pump prices are calculated. However, when it comes to adding features or making architectural decisions, domain knowledge becomes crucial for quality contributions and fewer review round trips.

Consider this example: an engineer is tasked with adding a "price override" feature to a convenience retail POS system. Without understanding the domain, they might implement it as a simple field that replaces the scanned price. But someone with domain knowledge would know that price overrides in retail need to account for manager approval workflows, audit trails for loss prevention, tax recalculations, loyalty point adjustments, and integration with the back-office reporting system. They'd also know that certain items like fuel, tobacco, and alcohol often have regulatory restrictions on price modifications. The engineer lacking this context might go through three or four review cycles before landing on the right approach, while someone with domain understanding gets it right the first time.

This knowledge transfer is typically handled through a buddy system where an assigned team member walks the new joiner through the current architecture at a high level. One important note here: keeping architecture diagrams up to date can feel like thankless work with no short-term rewards, but it pays dividends every time someone new joins the team.

Team Rituals

Stand-ups, sprint ceremonies, retrospectives, RCAs (Root Cause Analyses) and other team rituals are also part of onboarding. These won't be covered in this article since we're focusing on technical aspects, but they're worth mentioning for completeness.

Tech-Stack Familiarity

Tech-stack familiarity is usually filtered for during hiring or internal transfers. If you're hiring a backend engineer for a Java-based integration team, you're likely looking for candidates with Java or similar JVM experience. Knowledge of the stack naturally makes onboarding smoother.

That said, smooth onboarding practices become even more critical when tech-stack familiarity is low. If you've hired a strong engineer from a Python background into your Apache Camel and Spring Boot codebase, your onboarding process needs to carry more of the load.

Coding Standards

Every team develops conventions around how code should be written and organised. These include file naming standards, variable naming conventions, indentation preferences, and file structure patterns.

Some teams prefer their folder structure to mirror API endpoints. For example, in a retail integration service:

src/
├── api/
│   ├── v1/
│   │   ├── transactions/
│   │   │   ├── transactions.controller.ts
│   │   │   ├── transactions.service.ts
│   │   │   └── transactions.routes.ts
│   │   ├── inventory/
│   │   │   ├── inventory.controller.ts
│   │   │   ├── inventory.service.ts
│   │   │   └── inventory.routes.ts
│   │   └── fuel-prices/
│   │       ├── fuel-prices.controller.ts
│   │       ├── fuel-prices.service.ts
│   │       └── fuel-prices.routes.ts

With this structure, if a new engineer needs to work on the GET /api/v1/inventory endpoint that returns current tank dip readings, they immediately know to look in src/api/v1/inventory/. The cognitive load of navigating the codebase drops significantly.

Authentication, Environment Variables, and Secrets

This is often where onboarding gets frustrating. Different companies handle secrets and environment configuration in vastly different ways, and the friction here can make or break someone's first few days.

More mature organisations orchestrate access at an enterprise level using tools like ServiceNow, HashiCorp Vault, or AWS Secrets Manager, where permissions are tied to identity and granted automatically based on team membership. The less manual this process is, the better.

For teams without enterprise-grade tooling, here are some common approaches:

Symmetric encryption within the codebase: Some teams encrypt their .env files using a tool like git-crypt or sops and store them directly in the repository. New engineers just need the decryption password to access everything. This approach is convenient but carries risk since the password becomes a single point of compromise. A sensible mitigation is to only encrypt secrets for lower environments like dev and staging, keeping production secrets in a more secure system.

Encrypted files outside the codebase: Secrets are stored in a shared location (like an S3 bucket or internal file store) with company-wide access controls. Engineers with the right permissions can download what they need.

Manual sharing: The most primitive approach. Someone on the team carefully shares env files via secure channels. It works, but it doesn't scale and is prone to human error.

Whichever approach your team uses, the goal should be minimising the time between "I've cloned the repo" and "I have everything I need to run this locally."

Things Needed to Ensure Fast and Clean Onboarding

This section covers the practical tooling and processes that make onboarding frictionless. I've split it into two parts: getting engineers set up quickly (code pickup), and enabling them to make changes safely (change integration).

Part 1: Smooth Code Pickup

The goal here is simple: minimise the time between git clone and "I have a working local environment."

Code Formatting Standardisation

Inconsistent code formatting creates unnecessary noise in pull requests and wastes mental energy. A new engineer shouldn't have to guess whether to use tabs or spaces, or whether to add trailing commas.

Prettier is one of the most popular tools for solving this. Commit a .prettierrc file to your repository and every engineer's code gets formatted the same way:

{
  "semi": true,
  "singleQuote": true,
  "tabWidth": 2,
  "trailingComma": "es5",
  "printWidth": 100
}

Don't forget a .prettierignore file to prevent formatting generated files or dependencies:

node_modules/
dist/
coverage/
*.generated.ts

When a new engineer opens the codebase, these config files immediately communicate the team's standards without anyone needing to explain them.

Handy Scripts

Instead of a README that says "run npm install, then set up your .env file, then run docker-compose up, then..." wrap all of this in scripts. Scripts are executable documentation.

Setup Script

A setup script handles dependency installation and environment preparation. Here's an example for a retail POS integration service:

#!/bin/bash
# setup.sh - Interactive setup script for POS Integration Service

set -e

echo "🚀 Setting up POS Integration Service..."

# Check Node version
REQUIRED_NODE_VERSION="18"
CURRENT_NODE_VERSION=$(node -v | cut -d'v' -f2 | cut -d'.' -f1)

if [ "$CURRENT_NODE_VERSION" -lt "$REQUIRED_NODE_VERSION" ]; then
    echo "❌ Node.js version $REQUIRED_NODE_VERSION or higher is required."
    echo "   Current version: $(node -v)"
    echo "   Install via: https://nodejs.org/ or use nvm"
    exit 1
fi
echo "✅ Node.js version: $(node -v)"

# Check for .env file
if [ ! -f .env ]; then
    echo ""
    echo "📋 No .env file found."
    echo "   Would you like to:"
    echo "   1) Copy from .env.example (recommended for new setup)"
    echo "   2) Decrypt from .env.encrypted (requires team password)"
    echo "   3) Skip (I'll set it up manually)"
    read -p "   Enter choice [1-3]: " env_choice

    case $env_choice in
        1)
            cp .env.example .env
            echo "✅ Created .env from .env.example"
            echo "   ⚠️  Remember to update placeholder values"
            ;;
        2)
            read -s -p "   Enter decryption password: " password
            echo ""
            openssl aes-256-cbc -d -in .env.encrypted -out .env -pass pass:"$password"
            echo "✅ Decrypted .env file"
            ;;
        3)
            echo "⏭️  Skipping .env setup"
            ;;
    esac
else
    echo "✅ .env file exists"
fi

# Validate critical env vars
if [ -f .env ]; then
    source .env
    MISSING_VARS=()

    [ -z "$POS_API_BASE_URL" ] && MISSING_VARS+=("POS_API_BASE_URL")
    [ -z "$AZURE_SERVICE_BUS_CONNECTION" ] && MISSING_VARS+=("AZURE_SERVICE_BUS_CONNECTION")
    [ -z "$S3_BUCKET_NAME" ] && MISSING_VARS+=("S3_BUCKET_NAME")

    if [ ${#MISSING_VARS[@]} -gt 0 ]; then
        echo "⚠️  Missing required environment variables:"
        for var in "${MISSING_VARS[@]}"; do
            echo "   - $var"
        done
    else
        echo "✅ All required environment variables are set"
    fi
fi

# Install dependencies
echo ""
echo "📦 Installing dependencies..."
npm ci
echo "✅ Dependencies installed"

echo ""
echo "🎉 Setup complete! Run './start.sh' to start the service."

Notice how the script is interactive. It doesn't just fail silently when something is missing. It guides the engineer through decisions and helps them understand what the system needs. This is far more educational than a wall of README text.

Startup Script

A startup script gets the application running locally. It should aim to be system-agnostic by leveraging containers where possible:

#!/bin/bash
# start.sh - Start the POS Integration Service locally

set -e

echo "🚀 Starting POS Integration Service..."

# Check if setup has been run
if [ ! -d "node_modules" ]; then
    echo "❌ Dependencies not installed. Run './setup.sh' first."
    exit 1
fi

# Start local dependencies (mocked external services)
echo "📦 Starting local dependencies..."
docker-compose up -d localstack mockpos

# Wait for dependencies to be healthy
echo "⏳ Waiting for dependencies..."
sleep 5

# Start the service
echo "🏃 Starting service in development mode..."
npm run dev

Some teams combine setup and startup into a single script. Others keep them separate so you don't re-run setup every time you want to start the service. Either approach works as long as it's consistent.

example output from another script:

Functionality Test Script (Optional)

For extra confidence, you can provide a script that runs a quick smoke test against the locally running service:

#!/bin/bash
# smoke-test.sh - Verify the service is running correctly

BASE_URL="http://localhost:3000"

echo "🧪 Running smoke tests..."

# Health check
HEALTH=$(curl -s -o /dev/null -w "%{http_code}" "$BASE_URL/health")
if [ "$HEALTH" -eq 200 ]; then
    echo "✅ Health endpoint responding"
else
    echo "❌ Health check failed (HTTP $HEALTH)"
    exit 1
fi

# Test transaction endpoint with mock data
RESPONSE=$(curl -s -X POST "$BASE_URL/api/v1/transactions" \
    -H "Content-Type: application/json" \
    -d '{"storeId": "TEST001", "items": [{"sku": "MOCK123", "quantity": 1}]}')

if echo "$RESPONSE" | grep -q "transactionId"; then
    echo "✅ Transaction endpoint working"
else
    echo "❌ Transaction endpoint failed"
    exit 1
fi

echo ""
echo "🎉 All smoke tests passed!"

Brief but Useful README

People don't read long READMEs. Keep yours focused on operability rather than explaining the business problem the service solves. Save that for Confluence or your internal docs.

A good README structure:

# POS Integration Service

Handles transaction processing between store POS systems and central data lake.

## Quick Start

bash
./setup.sh # First time only
./start.sh # Start the service


Service runs at `http://localhost:3000`

## Useful Commands

- `npm run test` - Run unit tests
- `npm run test:integration` - Run integration tests
- `./smoke-test.sh` - Verify local setup works

## Documentation

- [Architecture Diagram](https://confluence.internal/pos-integration/architecture)
- [API Specification](https://confluence.internal/pos-integration/api-spec)
- [Runbook](https://confluence.internal/pos-integration/runbook)

That's it. A new engineer can get running in under a minute and knows where to find deeper documentation when needed.

Part 2: Smooth Change Integration

Once an engineer is set up locally, the next challenge is enabling them to make changes confidently without breaking things.

Descriptive File and Function Naming

Clear naming conventions reduce the learning curve dramatically. When files are named descriptively, new engineers can navigate the codebase intuitively.

Consider a retail integration service with these common file patterns:

src/
├── clients/
│   ├── pos.client.ts           # Handles POS API communication
│   ├── serviceBus.client.ts    # Azure Service Bus operations
│   └── s3.client.ts            # S3 storage operations
├── services/
│   ├── transaction.service.ts  # Transaction business logic
│   └── inventory.service.ts    # Inventory business logic
├── utils/
│   ├── date.utils.ts           # Date formatting helpers
│   └── validation.utils.ts     # Input validation helpers
└── builders/
    └── transaction.builder.ts  # Builds transaction payloads

If a new engineer needs to modify how transactions are sent to S3, they know to look in s3.client.ts. If they need to change business logic, they check the services folder. The naming convention acts as a map.

Treat these as principles rather than rigid rules. The goal is descriptive, predictable naming that helps people find what they need.

Unit Tests

All those clients, utils, helpers, and services should have accompanying tests. When a new team member modifies transaction.service.ts, they can run the tests to verify they haven't broken existing functionality:

// transaction.service.test.ts
describe('TransactionService', () => {
  describe('processTransaction', () => {
    it('should calculate correct total for multiple items', () => {
      const items = [
        { sku: 'FUEL001', quantity: 45.5, unitPrice: 1.89 },
        { sku: 'SNACK001', quantity: 2, unitPrice: 3.50 }
      ];

      const result = transactionService.processTransaction(items);

      expect(result.total).toBe(92.995);
    });

    it('should apply fuel discount for loyalty members', () => {
      const items = [{ sku: 'FUEL001', quantity: 40, unitPrice: 1.89 }];
      const loyaltyId = 'LOYALTY123';

      const result = transactionService.processTransaction(items, loyaltyId);

      expect(result.fuelDiscount).toBe(0.10);
      expect(result.total).toBe(71.60);
    });
  });
});

Tests serve as living documentation. A new engineer can read the test file to understand what a function is supposed to do without digging through implementation details.

Pre-commit and Pre-push Hooks

Git hooks catch issues before they reach the remote repository. Tools like Husky make this easy to set up:

// package.json
{
  "husky": {
    "hooks": {
      "pre-commit": "npm run lint && npm run format:check",
      "pre-push": "npm run test"
    }
  }
}

A typical setup runs linting and format checks on commit (fast feedback), and runs tests before push (thorough validation).

One word of caution: keep these hooks fast. If your pre-commit takes 30 seconds, engineers will start bypassing it with --no-verify. Aim for under 5 seconds on pre-commit.

Common Libraries (For Multi-Service Teams)

When your team owns multiple services, you'll notice patterns emerging. The same S3 client code, the same transaction builder, the same logging setup. Instead of copy-pasting across repositories, extract these into a shared library.

// @my-org/retail-common
import { S3Client } from '@my-org/retail-common';
import { TransactionBuilder } from '@my-org/retail-common';

const s3 = new S3Client({ bucket: process.env.S3_BUCKET });
const transaction = new TransactionBuilder()
  .withStore('STORE001')
  .withItems(items)
  .build();

A few guidelines for common libraries:

Use semantic versioning with alpha/beta releases so teams can test changes before they go stable
Write rigorous tests. A bug in a common library affects every consuming service
Document breaking changes clearly in your changelog

Pipeline Repositories (For Multi-Service Teams)

GitHub Actions, GitLab CI, and Azure Pipelines all support reusable workflow definitions. Instead of duplicating deployment logic across repositories, centralise it:

# In your service repository
jobs:
  deploy:
    uses: my-org/pipeline-templates/.github/workflows/deploy-to-aws.yml@v2
    with:
      environment: staging
      service-name: pos-integration

When deployment processes change, you update one repository instead of twenty.

Branching Strategies and Policies

Protect your main branch from direct commits. This is non-negotiable for team safety. Configure your repository to require:

Pull request reviews (at least one approver)
Passing CI checks before merge
No force pushes to main

This protects new engineers from accidentally pushing directly to production. The guardrails are there before they even make their first commit.

Environment Strategies and Policies

Development environments should be open for experimentation. Engineers need a place to break things safely.

Staging and UAT environments should mirror production as closely as possible, with stricter deployment controls:

# azure-pipelines.yml
stages:
  - stage: DeployDev
    condition: eq(variables['Build.SourceBranch'], 'refs/heads/develop')
    jobs:
      - deployment: DeployToDev
        environment: development  # Auto-deploys, no approval needed

  - stage: DeployStaging
    condition: eq(variables['Build.SourceBranch'], 'refs/heads/main')
    jobs:
      - deployment: DeployToStaging
        environment: staging  # Requires manual approval
        strategy:
          runOnce:
            deploy:
              steps:
                - script: ./deploy.sh staging

This ensures that code can flow freely to dev, but staging deployments require explicit approval.

Integration Tests

Unit tests verify individual components. Integration tests verify that services work together correctly.

For a retail integration service, an integration test might verify that a transaction flows correctly from the POS mock through your service and into S3:

describe('Transaction Flow Integration', () => {
  it('should process POS transaction and store in S3', async () => {
    // Arrange
    const mockTransaction = createMockPOSTransaction();

    // Act
    await posClient.sendTransaction(mockTransaction);
    await waitForProcessing(5000);

    // Assert
    const storedTransaction = await s3Client.getTransaction(mockTransaction.id);
    expect(storedTransaction).toBeDefined();
    expect(storedTransaction.status).toBe('PROCESSED');
  });
});

Integration tests give new engineers confidence that their changes haven't broken contracts with other systems.

Culture of Maintenance

Finally, build a culture that keeps quality high as new engineers join:

Code coverage thresholds: Configure your pipeline to fail if coverage drops below a threshold. This ensures new code comes with tests:

# jest.config.js
coverageThreshold: {
  global: {
    branches: 80,
    functions: 80,
    lines: 80,
    statements: 80
  }
}

Integration tests as part of feature work: A feature isn't done until its integration tests are written. Make this explicit in your definition of done.

Strict but fair reviews: Code reviews should enforce standards consistently, but reviewers should also be helpful and educational. A review that just says "wrong" teaches nothing. A review that explains why something should change helps engineers grow.

Advantages and How This Helps My Teams

All the upfront investment in scripts, tests, and automation pays dividends quickly. Here's what I've seen in practice.

Saves Time for Onboarded Members

New engineers on my team don't spend their first day wrestling with environment setup or hunting down secrets. They clone the repo, run ./setup.sh, and follow the interactive prompts. Within an hour, they have a working local environment and can start exploring the codebase.

Compare this to the alternative: a new joiner pinging five different people on Slack asking where to find the database credentials, discovering their Node version is wrong after hitting a cryptic error, and spending half a day just getting the service to start. That frustration compounds and sets a negative tone for the entire onboarding experience.

Saves Time for Service Owners

Before I invested in these practices, onboarding a new engineer meant hours of hand-holding. "Where's the config for X?" "How does Y work?" "Why is Z failing?"

Now, when someone asks me a question, I can often point them to a specific file or test. "Check transaction.service.test.ts, the third test case covers exactly that scenario." The tests become documentation. The scripts become guides. I'm not the bottleneck anymore.

This is especially valuable when you're leading a team and your time is split across architecture decisions, stakeholder meetings, and code reviews. Every hour saved on repetitive explanations is an hour you can spend on higher-leverage work.

Reduces Time in Change Management and Knowledge Transfer

When an engineer leaves the team or moves to another project, the knowledge transfer burden is significantly lighter. The important patterns are encoded in common libraries. The deployment process is captured in pipeline templates. The business logic is documented through tests.

New engineers inheriting a service don't need a week of shadowing sessions. The codebase is largely self-explanatory.

Reviews Are Short and Sweet

This one might be my favourite. When automated checks handle formatting, linting, test coverage, and integration verification, code reviews can focus on what actually matters: logic, architecture, and edge cases.

I no longer leave comments like "missing semicolon" or "incorrect indentation." Prettier handles that. I don't have to verify that tests exist. The coverage threshold enforces it. The review becomes a conversation about the change itself rather than a checklist of mechanical issues.

Pull requests that used to require three rounds of back-and-forth now get approved on the first or second pass.

Disadvantages

No approach is without trade-offs. Here are the downsides I've encountered and some honest reflections on them.

A Lot of Initial Work

Setting up robust scripts, configuring pipelines, writing comprehensive tests, and building common libraries takes time. Time that could otherwise go toward feature delivery.

I personally don't find this burdensome because I've seen the compounding benefits across multiple teams. But I understand the hesitation. When you're under pressure to ship a fuel pricing integration before the end of the quarter, spending two days writing a setup script feels like a luxury you can't afford.

The reality is that this investment is easier to justify on greenfield projects or during quieter periods. Retrofitting these practices onto a legacy codebase with looming deadlines is genuinely difficult. Sometimes you have to be pragmatic and introduce improvements incrementally rather than all at once.

Can Cause Friction in Delivery If Not Managed Well

Standards are helpful until they become obstacles. If your automated checks are too strict or too slow, they start blocking legitimate work.

Consider this scenario: your team has a rule that every pull request must have 90% code coverage. An engineer is fixing a critical bug in the loyalty points calculation that's causing customers to lose discounts at checkout. The fix is two lines, but to satisfy the coverage requirement, they'd need to write fifteen new tests for an untested legacy function they happened to touch. The bug sits in production for an extra day while they write tests for unrelated code.

Another example: you've established a convention that all API responses must follow a specific format. But the convention lives only in a Confluence page that nobody reads. Without automated schema validation, engineers keep forgetting. Reviews become tedious nitpicking sessions, and resentment builds. "Why did my PR get blocked for a formatting issue when the last three PRs got merged without it?"

The lesson here is that standards need automated enforcement to be sustainable. If it can't be checked by a machine, it will eventually be ignored by humans.

OS Compatibility Issues

Scripts written on macOS often break on Windows, and vice versa. This is a constant source of friction for teams with mixed development environments.

A simple example:

#!/bin/bash
# Works on macOS and Linux, fails on Windows

# macOS sed syntax
sed -i '' 's/old/new/g' config.json

# Linux sed syntax (different from macOS)
sed -i 's/old/new/g' config.json

Or path handling:

# Unix-style paths
CONFIG_PATH="./config/local/settings.json"

# Windows needs backslashes (or Git Bash to translate)
CONFIG_PATH=".\config\local\settings.json"

Mitigation strategies include using cross-platform tools like Node.js scripts instead of bash, containerising your development environment with Docker, or maintaining separate scripts for different platforms. None of these are perfect, but they reduce the pain.

Script Sprawl

When you start automating everything, you can end up with a dozen scripts scattered across your repository:

scripts/
├── setup.sh
├── setup-windows.ps1
├── start.sh
├── start-docker.sh
├── run-tests.sh
├── run-integration-tests.sh
├── deploy-dev.sh
├── deploy-staging.sh
├── generate-mocks.sh
├── update-snapshots.sh
├── clean.sh
└── seed-database.sh

A new engineer clones the repo and has no idea which script to run first. "Do I run setup.sh or start.sh? What's the difference between start.sh and start-docker.sh?"

The solution is consolidation and documentation. Consider a single entry point script with subcommands:

./run.sh setup      # First-time setup
./run.sh start      # Start the service
./run.sh test       # Run unit tests
./run.sh test:int   # Run integration tests

Or use a Makefile, which is language-agnostic and self-documenting:

.PHONY: help setup start test

help:
    @echo "Available commands:"
    @echo "  make setup    - First-time environment setup"
    @echo "  make start    - Start the service locally"
    @echo "  make test     - Run unit tests"
    @echo "  make test-int - Run integration tests"

setup:
    ./scripts/setup.sh

start:
    docker-compose up -d
    npm run dev

test:
    npm run test

Running make help gives engineers a clear menu of options. No more guessing.

Conclusion

Engineering onboarding isn't a one-time event. With average tenures shrinking and teams constantly evolving, it's a recurring challenge that deserves intentional investment.

The practices outlined in this article aren't revolutionary. Setup scripts, automated formatting, comprehensive tests, and protected branches are all well-established ideas. The difference lies in treating them as a cohesive system rather than isolated improvements. Each piece reinforces the others. Scripts reduce setup friction. Tests enable confident changes. Automation shortens reviews. Together, they create an environment where a new engineer can go from git clone to meaningful pull request in days rather than weeks.

The upfront cost is real. Writing that first setup script takes time you could spend on features. Configuring pipeline templates isn't glamorous work. But every engineer who joins your team after that benefits. The investment amortises quickly, and the compound returns are substantial.

Start where you are. If your team has none of these practices, don't try to implement everything at once. Pick one pain point. Maybe it's the two hours new joiners spend setting up their environment. Write a setup script. Maybe it's the endless formatting debates in code reviews. Add Prettier. Small improvements stack up.

Your future team members will thank you. And honestly, so will your future self the next time you have to onboard someone new.

This article is formatted and grammatically enhance with AI.

Your Integration Layer is Probably Over-Engineered (Let's Fix It with Camel DSL)

JOOJO DONTOH — Tue, 28 Oct 2025 15:01:17 +0000

Introduction

Hi guys it's me again 😁. If you've ever wished you could describe complex integration flows in plain, readable language rather than wrestling with boilerplate code, you're going to love DSLs. A Domain-Specific Language (DSL) is essentially a specialized mini-language designed for a particular task. Think of it as a shorthand that lets you express what you want to do without getting lost in the how. Easy peasy

Apache Camel embraces this philosophy wholeheartedly and offers developers multiple flavors of DSLs to choose from. Whether you're a fan of the Java DSL with its fluent builder style, prefer the structured clarity of XML DSL in Camel XML files, or lean toward Spring XML for classic Spring configurations, there's something literally for everyone. You can define routes using YAML DSL for a clean, human-readable format, build RESTful services with Rest DSL (including contract-first approaches with OpenAPI specs), or even keep things annotation-based with the Annotation DSL right in your Java beans.

In this article, we're narrowing down on the YAML DSL. It's a lightweight, intuitive way to define integration routes that feels more like writing a configuration file than coding. It follows the declarative way of engineering. We'll explore how to define routes, configure endpoints, and wire up beans, all while keeping our setup refreshingly simple. If you want to dive deeper into the technical details, the official Camel YAML DSL documentation is a great resource. But for now, let's keep things practical and see what makes YAML DSL such a joy to work with.

What Are Enterprise Integration Patterns?

Before we dive into Camel DSL, let's talk about the "why" behind it all. Enterprise Integration Patterns (EIPs) are tried-and-true solutions to common problems that pop up when you're connecting different systems. These are the plumbing of most of the systems you use. Think of them as design patterns, but specifically for the messy world of enterprise integration where you're constantly moving data between APIs, databases, message queues, and legacy systems that were never meant to talk to each other.

The concept was popularized by Gregor Hohpe and Bobby Woolf's seminal book, Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions. Published in 2003, this book catalogued around 65 patterns that have since become the lingua franca for integration architects. Apache Camel was built from the ground up to implement these patterns, making them accessible through its DSL rather than forcing you to reinvent the wheel every time.

Here are five of the most common EIPs you'll encounter:

1. Content-Based Router

Routes messages to different destinations based on their content. For example, you might route orders to different processing queues based on their total amount—high-value orders go to manual review, while smaller ones get auto-processed. Read more here

2. Message Filter

Acts as a gatekeeper, allowing only messages that meet certain criteria to pass through. Think of it as a bouncer for your data—only messages with valid formats or specific properties get through. Read more here

3. Splitter

Takes a single message containing multiple items (like a batch of orders) and breaks it into individual messages. This is handy when you need to process each item independently or send them to different endpoints.Read more here

4. Aggregator

The opposite of a splitter—it combines multiple related messages into a single cohesive message. Perfect for scenarios where you're collecting responses from multiple services before sending a unified reply.
Read more here

5. Dead Letter Channel

Your safety net for when things go wrong. Messages that fail processing get routed to a special "dead letter" queue where you can inspect them, fix issues, and potentially reprocess them later.
Read more here

You can explore the full catalog of patterns in the Apache Camel EIP documentation, but these five alone will cover a huge portion of your integration needs.

What is Apache Camel DSL?

A Brief History

Apache Camel didn't appear out of nowhere. It was born from a real need to implement those Enterprise Integration Patterns we just discussed. Here's how it evolved:

2003 marked the beginning when Gregor Hohpe and Bobby Woolf's Enterprise Integration Patterns book was published, laying the groundwork for what would become Apache Camel's DNA.

June 27, 2007 saw Camel's initial release. From day one, the DSL was baked into its core design philosophy. The goal was simple: make it ridiculously easy to describe integration flows without drowning in boilerplate code.

In those early years, Camel established itself as the go-to framework for implementing EIPs in Java. The DSL wasn't just an afterthought, it was the way you worked with Camel, turning complex integration logic into readable, maintainable code.

As adoption grew, so did the DSL. It evolved beyond Java, expanding to support XML, YAML, Groovy, and other syntaxes. This gave developers the freedom to pick the language that fit their team's style and existing infrastructure.

2014 (Camel 2.14) brought a game-changer: extensions to the routing DSL specifically for REST endpoints. Suddenly, building RESTful integrations became as straightforward as defining any other route.

Today, ongoing development keeps the DSL at the forefront of Camel's priorities. The community continuously refines it, making routes easier to write, test, and maintain. New features and improvements roll out regularly, keeping pace with modern integration challenges.

How Does Camel YAML DSL Work?

From YAML to Running Integration

Here's where things get really interesting. You write a simple YAML file declaring your integration patterns, and somehow it becomes a running Java application. But here's the key distinction: this isn't compilation—it's interpretation.

When you run a Camel YAML route, you're not generating Java source code, compiling it, and then executing bytecode. Instead, Camel reads your YAML file at runtime, parses it, and dynamically constructs the integration flow in memory. Think of it like the difference between translating a book (compilation) versus having a real-time interpreter translate as you speak (interpretation). The YAML DSL is interpreted into Camel's internal routing model on the fly.

Enter JBang

So how do we actually run these YAML files? Meet JBang—a tool that lets you run Java applications with zero ceremony. No project setup, no build files, no IDE required. Just a simple command and you're off to the races.

JBang is essentially a launcher and script runner for Java. It can download dependencies, manage classpaths, and execute Java code—all from a single command. For Camel, this means you can run integration routes as easily as running a Python script.

The Interpretation Process

When you execute jbang camel@apache/camel run route.yaml, here's what happens under the hood:

JBang Downloads & Starts: JBang fetches the Camel catalog if it's not already cached and initializes the runtime environment.
File Detection: JBang detects that you've provided a YAML file and determines which parser to use.
YAML Parsing: The YAML content is parsed into a structured format that can be processed.
Convert to Camel Model: The parsed YAML is transformed into Camel's internal route model—essentially Java objects that represent your integration flow.
Component Resolution: Camel identifies which components you're using (HTTP, Kafka, databases, etc.) and loads them if needed.
Endpoint Creation: Based on your route definitions, Camel creates endpoint instances that represent the actual connections to external systems.
Processor Pipeline Creation: Any transformations, filters, or logic in your route are assembled into a processing pipeline.
Consumer Creation: Camel sets up consumers that will trigger your route (like an HTTP listener or a file watcher).
Route Registration & Start: Finally, the complete route is registered with Camel's context and started, ready to process messages.

All of this happens in seconds, and you're left with a fully functional integration running in a JVM.

For those curious about the inner workings, the Apache Camel GitHub repository is a treasure trove of information.

Hands-On: Building a Weather Service

Let's get practical. We'll build a simple REST service that fetches weather information for a given country. This example will demonstrate route definition, endpoint definition, and how Camel handles external API calls.

Step 1: Install JBang

First, install JBang. On macOS or Linux:

curl -Ls https://sh.jbang.dev | bash -s - app setup

On Windows (using PowerShell):

iex "& { $(iwr -useb https://ps.jbang.dev) } app setup"

Verify the installation:

jbang version

Step 2: Create Your Route

Create a file called weather-service.yaml:

- route:
    id: weather-api-route
    from:
      uri: "platform-http:/weather"
      parameters:
        httpMethodRestrict: "GET"
      steps:
        - setProperty:
            name: country
            simple: "${header.country}"

        - choice:
            when:
              - simple: "${header.country} == null"
                steps:
                  - setHeader:
                      name: Content-Type
                      constant: application/json
                  - setBody:
                      constant: '{"error": "Country parameter is required"}'
                  - setHeader:
                      name: CamelHttpResponseCode
                      constant: "400"
                  - stop: {}

        - removeHeaders:
            pattern: "*"

        - toD:
            uri: "https://wttr.in/${exchangeProperty.country}"
            parameters:
              format: "j1"
              bridgeEndpoint: "true"

        - unmarshal:
            json:
              library: Jackson

        - setBody:
            simple: "The weather in ${exchangeProperty.country} is ${body[current_condition][0][temp_C]} degrees Celsius with ${body[current_condition][0][weatherDesc][0][value]}"

        - setHeader:
            name: Content-Type
            constant: text/plain

Let me break down what's happening here:

Route Definition: We define a route with ID weather-api-route that starts from an HTTP endpoint.

Endpoint Definition: The platform-http:/weather endpoint creates an HTTP server listening on the /weather path, restricted to GET requests only.

Processing Steps:

We extract the country parameter from the request header and store it as an exchange property
We validate that the country parameter exists, returning a 400 error if it doesn't
We remove all incoming HTTP headers with removeHeaders to prevent them from interfering with our external API call
We use toD (dynamic to) to call the wttr.in weather API with the country variable. The D in toD means it evaluates expressions at runtime
We set bridgeEndpoint: "true" to properly bridge the HTTP connection between our endpoint and the external API
We unmarshal the JSON response using Jackson
We transform it into a readable sentence using Camel's Simple language
We set the response content type to plain text

Important Notes:

We use toD instead of to because we need to evaluate the ${exchangeProperty.country} expression dynamically at runtime

The removeHeaders step is crucial - without it, headers from the incoming request would be passed to the external API, causing errors

In YAML DSL, query parameters must be in the parameters section, not embedded in the URI

Step 3: Run Your Route

Execute your route with JBang:

jbang camel@apache/camel run weather-service.yaml

You'll see Camel start up, and it will tell you which port it's listening on (typically 8080). The first startup might take a bit longer as JBang downloads dependencies.

Step 4: Test It Out

Open another terminal and try it:

curl "http://localhost:8080/weather" -H "country: London"

You should get a response like:

The weather in London is 15 degrees Celsius with Partly cloudy

Try different cities:

curl "http://localhost:8080/weather" -H "country: Tokyo"
curl "http://localhost:8080/weather" -H "country: NewYork"
curl "http://localhost:8080/weather" -H "country: Paris"

Test the error handling:

curl "http://localhost:8080/weather"

This should return:

{"error": "Country parameter is required"}

Note on Response Times: You might notice requests take 5-15 seconds to complete. This isn't your Camel route being slow—it's waiting on wttr.in, a free weather service that aggregates data in real-time. If you test the API directly (curl "https://wttr.in/London?format=j1"), you'll see it takes about the same time. Your Camel route itself is fast; it's just waiting on the external dependency. In production, you'd typically use a faster commercial weather API or implement caching to improve response times.

Step 5: Visualize with Karavan

Want to see your route visually? Install the Karavan extension in VS Code:

Open VS Code
Go to Extensions (Ctrl+Shift+X or Cmd+Shift+X on Mac)
Search for "Karavan"
Install the "Karavan" extension by Apache Software Foundation

Once installed:

Right-click on your weather-service.yaml file in VS Code
Select "Karavan: Open" from the context menu
You'll see a visual representation of your route with all the steps laid out graphically

The Karavan designer shows your route as a flowchart—from the HTTP endpoint through validation, header removal, the external API call, JSON unmarshalling, transformation, and finally the response. It's incredibly helpful for:

Understanding complex routes at a glance
Debugging integration flows
Communicating integration logic to non-technical stakeholders
Editing routes visually instead of writing YAML by hand

You can even use Karavan to build routes from scratch by dragging and dropping components, and it will generate the YAML for you!

Why Use Camel YAML DSL? Real-World Advantages

Rapid Project Spinning with Proven Patterns

Camel YAML DSL gives you instant access to 65+ pre-built Enterprise Integration Patterns. Need a content-based router, message filter, or splitter? It's just a few lines of YAML. What typically takes weeks to build from scratch—complete with error handling and retry logic—becomes an afternoon of configuration. You're describing what you want, and Camel handles the how.

MVPs and Proof of Concepts at Lightning Speed

For startups and innovation teams, speed matters. Build a functioning integration layer in hours, not weeks. Connect to Stripe, send Twilio notifications, sync to your CRM—all achievable in a single afternoon. When business requirements change (and they will), your routes change with them. No massive refactoring, just edit YAML and redeploy.

Democratizing Integration Development

You don't need to be a Java expert to build integrations anymore. JBang and YAML DSL lower the barrier dramatically. Business analysts, product managers, or junior engineers with basic YAML knowledge can create and modify routes. This shifts senior engineers from implementing routine integrations to building reusable components and solving genuinely complex problems.

Centralizing Integration Logic Across Teams

Companies struggle with integration sprawl—every team building their own connectors and reinventing the same patterns. Camel DSL solves this. A platform team maintains common components and patterns, while product teams compose them as needed. Want all APIs to use exponential backoff? Configure it once. Need centralized logging? Build it into your base template. One update benefits every route. This is a personal favorite of mine

Building Integration Platforms on Kubernetes

Here's where Camel truly shines at scale. Companies build entire integration platforms using this pattern:

The Setup: Multiple Kubernetes clusters divided into namespaces (workspaces), each owned by a team. Each integration runs as a pod within these namespaces.

Smart Routing: Route traffic between namespaces or clusters using Kubernetes networking. A payment request flows through payment → order → notification namespaces seamlessly. Add content-based routing to direct European orders to EU clusters for data residency.

Security: Kubernetes secrets hold API keys, passwords, and tokens. Each pod only accesses its required secrets—payment pods can't see shipping credentials, notification pods can't access payment gateways.

Team Autonomy: Teams deploy their integrations independently. They write YAML routes, push to Git, and CI/CD handles the rest. Teams set their own resource limits, autoscaling rules, and health checks while following platform-level guardrails.

The Payoff: Self-service integration platform. Fast team velocity. Enforced security and governance. Kubernetes handles scaling. Clean, readable YAML that stakeholders understand. Enterprises already use this pattern to manage thousands of integrations across distributed teams.

The Other Side: When DSLs Fall Short

High Abstraction Hides Complexity

YAML DSL sits at a very high level of abstraction. You declare what you want, and Camel figures out how to make it happen. This is powerful—until something breaks or behaves unexpectedly.

Debugging becomes detective work. Your YAML looks correct, but the route isn't working. Is it a timing issue? A header not being set? A component default you didn't know about? Suddenly you need to understand:

How Camel interprets and executes routes
Underlying component implementations
Message exchange patterns and contexts
How properties and headers flow through pipelines

The abstraction that made development fast now makes troubleshooting slow. You end up diving into Camel documentation, examining verbose logs, and sometimes reading Java source code to understand what's really happening under your simple YAML declarations.

It's Java All the Way Down

Here's the uncomfortable truth: Camel is a Java framework. When things get complex or custom, you're writing Java code.

Need a transformation more complex than Simple language supports? Write Java. Want a custom component or specialized error handling? Java. Need to debug a route deadlocking under load? Better understand Java concurrency, exception handling, and threading models.

This creates a skills gap. You might have team members comfortable with YAML who hit a wall when they need to drop into Java. Or you have Java experts who view the DSL as unnecessary abstraction. Either way, the promise of "just write YAML" only holds for straightforward integrations. Anything non-trivial requires Java expertise.

Organizational Startup Costs

Building an integration platform around Camel DSL isn't trivial. There's real investment required:

Learning Curve: Teams need to learn YAML syntax, Camel concepts, EIP patterns, component configurations, and debugging techniques. This takes time—expect weeks or months before teams are productive.

Infrastructure Setup: You need CI/CD pipelines, container registries, Kubernetes clusters (if going that route), monitoring systems, and logging infrastructure. Setting this up properly takes effort.

Governance and Standards: Someone needs to establish coding standards, route templates, security policies, and deployment procedures. Without this, you'll end up with inconsistent routes that are hard to maintain.

Cultural Change: Non-engineers writing integrations sounds great in theory, but requires buy-in. Engineers might resist sharing control. Business analysts might not want the responsibility. Establishing this new way of working takes organizational effort.

The payoff is significant, but don't underestimate the upfront investment needed to make it work.

Configuration Has Its Limits

YAML DSL is fantastic for standard patterns, but sometimes you need something unique. Maybe you're integrating with a legacy system that has quirky authentication. Or you need custom data transformation logic that's too complex for Simple language. Perhaps you're dealing with binary protocols that don't fit Camel's HTTP-centric model.

When configuration isn't enough, you have to write custom Java components or processors. Now you're maintaining both YAML routes and Java code, which defeats some of the simplicity. You also create a two-tier system: simple integrations anyone can handle, and complex ones that require Java developers.

The line between "configurable" and "needs code" isn't always clear until you're deep into implementation. What starts as "just a simple route" can evolve into something requiring custom Java, at which point you might wonder if starting with code would have been cleaner.

Wrapping Up

Apache Camel YAML DSL represents a pragmatic approach to enterprise integration. It takes decades of integration patterns, wraps them in readable YAML, and makes them accessible to a broader audience than traditional Java frameworks ever could.

Is it perfect? No. You're trading explicit control for declarative simplicity, and that tradeoff won't suit every team or every project. But for organizations drowning in integration complexity, or startups needing to move fast, or platform teams trying to democratize development—it's a compelling option.

The key is knowing when to reach for it. Building a weather API demo? Perfect. Connecting your e-commerce platform to payment gateways, inventory systems, and shipping providers? Absolutely. Implementing something so custom that you're fighting the framework every step of the way? Maybe reconsider.

Start small. Spin up JBang, write a simple route, see how it feels. You might find that what once took your team weeks now takes hours, and that's worth paying attention to. The patterns are proven, the tooling is mature, and the community is active. Give it a shot—your future self (and your integration backlog) might thank you. See you in the next one

Skip the Database: Building Analytics Dashboards Directly from S3 Files

JOOJO DONTOH — Sat, 27 Sep 2025 07:57:42 +0000

Introduction

Hello guys, my name is Jo and welcome back to my engineering blog!.
I'm gonna talk about something cool today, let's go!😎.
In today's data-driven landscape, organizations collect vast amounts of information from multiple sources in various formats including structured databases, APIs, IoT sensors, and application logs, with much of this data landing in formats like CSV and JSON files that need to be stored, processed, and analyzed efficiently. While Amazon S3 serves as an excellent scalable storage solution for these files, it's a significant engineering challenge 🤕 to extract meaningful insights from data sitting in S3 buckets, especially when stakeholders need regular access to visualizations and reports without the overhead of complex data pipeline management.
Traditional approaches often involve provisioning expensive database instances, writing ETL jobs to move data around, and maintaining multiple data stores just to enable business intelligence tools like Power BI to access the information they need. In this article, I'll walk you through a cost-effective, serverless architecture that uses AWS Glue for data cataloging, Amazon Athena for SQL querying, and Power BI for visualization. This creates a streamlined solution that maintains S3 as your single source of truth while providing the SQL interface that Power BI expects, all without the complexity and cost of traditional database provisioning. Let's dive into it!

While Amazon S3 serves as an excellent scalable storage solution for these files, it's a significant engineering challenge 🤕 to extract meaningful insights from data sitting in S3 buckets

Situation

It's critical to understand that the solution presented in this article is designed for a specific set of circumstances and constraints, and while it may apply to many similar use cases, every engineering decision should start with clearly defined functional and non-functional requirements that match your particular context. In our scenario, CSV files are being deposited into an S3 bucket at regular intervals, which creates a steady stream of data that needs to be accessible for analysis and reporting. These files come pre-structured with a consistent, well-defined schema and the data has already been cleaned and sanitized upstream, meaning we don't need to handle data quality issues, type conversions, or missing field scenarios within our solution. Most importantly, the data represents a single business entity without complex relationships or foreign key dependencies that would require maintaining referential integrity across multiple tables, allowing us to treat each file as a self-contained dataset that can be processed and queried independently.

CSV Files (Regular Intervals) → S3 Bucket Storage
     ↓
Well-structured + Pre-sanitized Data
     ↓
Single Entity (No Relational Dependencies)
     ↓
Ready for Processing (SQL)

Problem/Requirements

The core challenge lies in making the data dumped into S3 buckets accessible and actionable for stakeholders who need regular visibility into operational insights through dashboards and reports. Currently, the team handles similar data requirements by loading CSV contents into traditional SQL databases, which then serve as the data source for Power BI through standard database connections. This solution exists primarily because Power BI has robust, well-established SQL connectivity and most teams are familiar with this approach. However, this conventional method introduces several significant problems that compound over time and create unnecessary operational overhead.

Key Issues with the Current Approach:

Infrastructure Costs - Provisioning and maintaining database instances (RDS, SQL Server, etc.) for what is essentially file storage creates ongoing monthly expenses that scale with data volume and performance requirements
Unnecessary Complexity - Data gets written into SQL tables purely for Power BI access, not because relational database features like ACID transactions, foreign keys, or complex joins are actually needed for this use case
Access Management Overhead - Every new database introduces another set of credentials, connection strings, and security policies that need to be managed, rotated, and monitored
Increased System Dependencies - Adding database services means more moving parts that can fail, require updates, need backup strategies, and demand monitoring - each new service multiplies your potential points of failure
Data Refresh Constraints - Power BI's periodic refresh cycles from SQL databases create timing dependencies where dashboard data freshness becomes a configuration trade-off rather than being driven by actual data availability
Data Fragmentation - Managing multiple data types across different storage solutions (files in S3, structured data in databases) creates inconsistencies in backup strategies, access patterns, and operational procedures

Solution

The Proposed Architecture

The solution replaces the traditional database-centric approach with a serverless, event-driven architecture that treats S3 as both storage and the single source of truth for all data operations. This approach maintains data in its original CSV format while providing the SQL query interface that Power BI requires, eliminating the need for intermediate database layers entirely.

Core Requirements

Our replacement solution addresses four fundamental requirements:

Single Source of Truth - All data remains in S3 in CSV format, eliminating data duplication and synchronization issues
Intelligent Cataloging - Files are automatically cataloged with schema information and logical partitioning, similar to database table structures but without the database overhead
Event-Driven Processing - New file arrivals trigger automatic catalog updates, ensuring data is immediately available for querying
SQL Query Layer - Athena provides the SQL interface that Power BI expects, querying directly against cataloged S3 data

Architecture Components

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   CSV File  │───▶│  S3 Bucket  │───▶│ S3 Event    │───▶│ SQS Queue   │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
                                                                   │
┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────┴───┐
│  Power BI   │◀───│   Athena    │◀───│ Glue Catalog│◀───│   Lambda    │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘

Tool Breakdown

Component	Purpose	Role in Solution
S3	Primary data storage	Houses original CSV files and partitioned data
SQS	Event queue	Buffers S3 file creation events for processing
Lambda	Event processor	Reads files, creates partitions, updates Glue catalog
Glue	Data catalog	Maintains schema and partition metadata for Athena
Athena	Query engine	Provides SQL interface for Power BI connectivity

Why This Solution Works

Cost Efficiency

Athena operates on a pay-per-query model, charging approximately $5 per TB of data scanned (AWS Athena Pricing), which is dramatically cheaper than maintaining always-on database instances that can cost hundreds of dollars monthly regardless of usage patterns.

Operational Simplicity

Single Data Format - Everything stays in CSV, eliminating format conversion complexity
Unified Access Control - S3 IAM policies manage all data access, no separate database credentials
Minimal Dependencies - The entire pipeline uses managed AWS services with no servers to maintain
Direct Integration - Athena provides native Power BI connectivity through ODBC/JDBC drivers

Performance Features

Query Caching - Athena automatically caches results for repeated queries
Partition Pruning - Only scans relevant data partitions, dramatically reducing query costs
Columnar Optimization - Can easily migrate to Parquet format later for even better performance

Trade-offs and Considerations

Performance Limitations

-- Traditional DB: Sub-second response for cached data
SELECT * FROM sales WHERE date = '2024-01-15';

-- Athena: 2-10 second response depending on data size
SELECT * FROM sales WHERE year='2024' AND month='01' AND day='15';

Athena queries have higher latency than traditional databases since they scan files rather than using pre-built indexes, but this trade-off is acceptable when query frequency is moderate and cost savings are significant.

Initial Setup Complexity

Power BI + Athena Integration Requirements:

Driver Installation - ODBC/JDBC drivers on dashboard servers
Network Configuration - VPC endpoints and security group rules for federated AWS accounts
Authentication Setup - IAM roles and cross-account access policies
Gateway Configuration - On-premises data gateway driver installation

# Example driver setup
wget https://s3.amazonaws.com/athena-downloads/drivers/ODBC/SimbaAthenaODBC-1.1.17.1000-Linux64.zip
# Configure connection string in Power BI
# Server: athena.us-east-1.amazonaws.com
# Database: your_glue_database

While the initial connectivity setup between Power BI and Athena can be complex, especially in enterprise environments with federated AWS accounts, these are one-time configuration challenges that become valuable learning experiences for teams expanding their cloud-native analytics capabilities.

Suggested Implementation

This implementation guide uses TypeScript, but the concepts translate to any language. The key is building a maintainable, extensible solution that follows the Open/Closed Principle - open for extension, closed for modification.

1. Core Schema Design (Critical Prerequisite)

Schema Analysis and Configuration

Start by examining your S3 files to understand the data structure and extract partition information from filenames:

// Example filename: sales_data_2024_03_15.csv
// Partition extraction: year=2024, month=03, day=15

const filename = "sales_data_2024_03_15.csv";
const partitionPattern = /sales_data_(\d{4})_(\d{2})_(\d{2})\.csv/;
const [, year, month, day] = filename.match(partitionPattern);

Shared Configuration Structure

Create a centralized configuration that serves both your code and Infrastructure as Code (IAC):

// shared-config.json
{
  "aws": {
    "region": "us-east-1",
    "account_id": "${AWS_ACCOUNT_ID}"
  },
  "resources": {
    "source_bucket": "my-company-data-source",
    "analytics_bucket": "my-company-analytics",
    "glue_database": "analytics_db",
    "workgroup_name": "analytics_workgroup",
    "sqs_queue": "file_processing_queue"
  },
  "entities": {
    "sales": {
      "table_name": "sales_data",
      "filename_pattern": "sales_data_(\\d{4})_(\\d{2})_(\\d{2})\\.csv",
      "partition_keys": ["year", "month", "day"],
      "schema": [
        {"name": "transaction_id", "type": "string"},
        {"name": "customer_id", "type": "string"},
        {"name": "amount", "type": "decimal(10,2)"},
        {"name": "transaction_date", "type": "date"}
      ]
    }
  }
}

2. Infrastructure as Code

Account-Agnostic Infrastructure

Structure your IAC to work across any AWS account with minimal changes:

# terraform/main.tf or cloudformation template
# All names reference shared config

resource "aws_s3_bucket" "analytics_bucket" {
  bucket = var.shared_config.resources.analytics_bucket

  versioning {
    enabled = true
  }
}

resource "aws_glue_catalog_database" "analytics_db" {
  name = var.shared_config.resources.glue_database
}

resource "aws_athena_workgroup" "analytics" {
  name = var.shared_config.resources.workgroup_name

  configuration {
    result_configuration {
      output_location = "s3://${aws_s3_bucket.analytics_bucket.bucket}/query-results/"
    }

    enforce_workgroup_configuration = true
    publish_cloudwatch_metrics = true
  }
}

Complete Resource Setup

// Infrastructure components from shared config
const infraConfig = {
  s3: {
    sourceBucket: config.resources.source_bucket,
    analyticsBucket: config.resources.analytics_bucket
  },
  glue: {
    database: config.resources.glue_database,
    tables: Object.keys(config.entities)
  },
  sqs: {
    queueName: config.resources.sqs_queue,
    // Note: Standard queue only - S3 events don't support FIFO
    type: "Standard"
  },
  lambda: {
    functionName: "file-processor",
    runtime: "nodejs22.x",
    sqsTrigger: true
  }
}

3. Core Lambda Logic

File Processing Flow

// lambda/fileProcessor.ts
import { SQSEvent, SQSRecord } from 'aws-lambda';
import { S3, Glue, Athena } from 'aws-sdk';

export const handler = async (event: SQSEvent) => {
  for (const record of event.Records) {
    await processS3File(record);
  }
};

async function processS3File(sqsRecord: SQSRecord) {
  // 1. Extract S3 event from SQS message
  const s3Event = JSON.parse(sqsRecord.body);
  const bucket = s3Event.Records[0].s3.bucket.name;
  const key = s3Event.Records[0].s3.object.key;

  // 2. Get entity configuration
  const entityConfig = getEntityFromFilename(key);

  // 3. Extract partition information
  const partitions = extractPartitions(key, entityConfig.filename_pattern);

  // 4. Read and validate file
  const fileContent = await s3.getObject({ Bucket: bucket, Key: key }).promise();

  // 5. Create partitioned path
  const partitionedPath = buildPartitionPath(entityConfig, partitions);

  // 6. Write to analytics bucket
  await s3.putObject({
    Bucket: config.resources.analytics_bucket,
    Key: partitionedPath,
    Body: fileContent.Body
  }).promise();

  // 7. Update Glue catalog
  await updateGlueCatalog(entityConfig, partitions, partitionedPath);

  // 8. Optional: Validate with Athena query
  await validateCatalogUpdate(entityConfig.table_name, partitions);
}

Partition Management

function buildPartitionPath(entityConfig: any, partitions: any): string {
  // Example: sales_data/year=2024/month=03/day=15/sales_data_2024_03_15.csv
  const partitionPath = entityConfig.partition_keys
    .map(key => `${key}=${partitions[key]}`)
    .join('/');

  return `${entityConfig.table_name}/${partitionPath}/${originalFilename}`;
}

async function updateGlueCatalog(entityConfig: any, partitions: any, s3Path: string) {
  const glue = new AWS.Glue();

  await glue.createPartition({
    DatabaseName: config.resources.glue_database,
    TableName: entityConfig.table_name,
    PartitionInput: {
      Values: entityConfig.partition_keys.map(key => partitions[key]),
      StorageDescriptor: {
        Location: `s3://${config.resources.analytics_bucket}/${s3Path}`,
        InputFormat: 'org.apache.hadoop.mapred.TextInputFormat',
        OutputFormat: 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
        SerdeInfo: {
          SerializationLibrary: 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
        }
      }
    }
  }).promise();
}

4. Power BI Connection Setup

Prerequisites Checklist

# 1. Local Development Setup
# Download Athena ODBC driver
wget https://s3.amazonaws.com/athena-downloads/drivers/ODBC/SimbaAthenaODBC-1.1.17.1000-Windows.msi

# 2. Gateway Installation (for enterprise distribution)
# Install driver on Power BI Gateway machine
# Configure data source in gateway admin console

Connection Configuration

# ODBC Connection String
Driver={Amazon Athena ODBC Driver};
Server=athena.us-east-1.amazonaws.com;
Port=443;
Database=analytics_db;
Workgroup=analytics_workgroup;
AuthenticationType=IAM Credentials;
UID=AKIA...;
PWD=secret_key;
S3OutputLocation=s3://my-company-analytics/query-results/;

Network Requirements

# VPC Endpoint Configuration (if using private connectivity)
vpc_endpoints:
  - service: com.amazonaws.us-east-1.athena
    route_table_ids: ["rtb-xxx"]
  - service: com.amazonaws.us-east-1.s3
    route_table_ids: ["rtb-xxx"]

# Security Group Rules
security_groups:
  athena_access:
    ingress:
      - protocol: tcp
        ports: [443]
        source: "power-bi-gateway-sg"

IAM Service Account

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "athena:*",
        "glue:GetDatabase",
        "glue:GetTable",
        "glue:GetPartitions",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:PutObject"
      ],
      "Resource": [
        "arn:aws:athena:*:*:workgroup/analytics_workgroup",
        "arn:aws:glue:*:*:catalog",
        "arn:aws:glue:*:*:database/analytics_db",
        "arn:aws:s3:::my-company-analytics/*"
      ]
    }
  ]
}

Security Best Practices

Workspace Binding - Restrict report access to specific distribution lists
Row-Level Security - Implement in Power BI if data contains sensitive information
Credential Rotation - Set up automatic rotation for IAM access keys
VPC Isolation - Use VPC endpoints for private connectivity when possible

This implementation provides a solid foundation that's extensible for additional entities and maintainable across different AWS environments.

Conclusion

This serverless analytics architecture demonstrates how modern cloud services can dramatically simplify data pipelines while reducing costs and operational overhead. By treating S3 as both storage and source of truth, we've eliminated the traditional database layer that often exists purely to satisfy business intelligence tool requirements, not actual business logic needs.

The solution delivers several key advantages: astronomical cost savings through Athena's pay-per-query model versus always-on database instances, operational simplicity with fewer moving parts to manage and monitor, and architectural flexibility that scales effortlessly with data volume and query frequency. The event-driven design ensures data is immediately available for analysis without complex ETL scheduling, while the shared configuration approach makes the entire solution extensible for additional data sources and entities.

While the initial Power BI connectivity setup requires some networking and driver configuration effort, these are one-time investments that unlock significant long-term value. The slight query latency trade-off compared to traditional databases is typically negligible for most business intelligence use cases, especially when weighed against the dramatic reduction in infrastructure costs and complexity.

Key Takeaway: Before defaulting to database solutions for analytics workloads, consider whether your use case actually requires relational database features like transactions, foreign keys, or complex joins. If you're simply storing structured data for reporting and visualization, this S3-native approach can deliver the same business outcomes with fraction of the cost and complexity.

The complete implementation, including Infrastructure as Code templates and Lambda functions, provides a foundation that teams can fork, customize, and extend for their specific data analytics needs. As cloud-native architectures continue to mature, patterns like this represent the future of cost-effective, scalable data engineering.

Ready to implement this solution? Start with the shared configuration design, analyze your existing data patterns, and begin building your serverless analytics pipeline today.

Building a Secure SFTP Server on a Linode Public Subnet

JOOJO DONTOH — Fri, 05 Sep 2025 07:04:52 +0000

In the previous post, I walked through setting up a bare-ish-metal cloud environment with Linode, partitioning public and private subnets, and wiring up your own proxies, and firewalls — without handing everything off to someone else. If you haven't read it, I suggest you do!

Today, I intend to go one level deeper:
For basic learning purposes, let’s build a secure SFTP server from scratch using the node in our public subnet.

Why should you do this you ask?
Because file transfer is a foundational primitive in ops, and there’s no reason to let that knowledge slip through the cracks.

Why Build Your Own SFTP Server?

You don’t need to rely on a SaaS or managed service (lots of manual ops btw)
You control access, retention, and isolation
You understand how file access and security actually work under the hood
You can build automations and workflows around it

This is especially useful when:

You’re collaborating with partners who need to send you files securely
You want to ship logs, reports, or ETL inputs into your infra
You’re learning how SSH, chroot jails, and Linux permissions actually work

Prerequisites

A provisioned Linode instance in your public subnet
A public IP address and port 22 open to trusted IPs
Basic Linux CLI knowledge

I handled the first 2 points in this article
We'll be using Ubuntu 22.04 LTS, but this works on most distros with openssh-server.

Step-by-Step Setup

1. Install OpenSSH

Make sure SSH is installed and running:

sudo apt update && sudo apt install openssh-server -y

2. Create an SFTP-Only User

sudo groupadd sftpusers

sudo useradd -m -G sftpusers -s /sbin/nologin sftpuser1
sudo passwd sftpuser1

This prevents shell access and groups users logically.

3. Create a Secure Directory Structure

OpenSSH’s ChrootDirectory requires that the parent dir is owned by root and not writable.

sudo mkdir -p /sftp/sftpuser1/upload
sudo chown root:root /sftp/sftpuser1
sudo chmod 755 /sftp/sftpuser1

sudo chown sftpuser1:sftpusers /sftp/sftpuser1/upload

This creates a writable /upload directory while keeping the jail secure.

4. Configure `sshd_config` for SFTP Jail

Append this block to the bottom of /etc/ssh/sshd_config (sudo nano /etc/ssh/sshd_config to open):

Match Group sftpusers
  ChrootDirectory /sftp/%u
  ForceCommand internal-sftp
  X11Forwarding no
  AllowTcpForwarding no

Then reload SSH:

sudo systemctl restart ssh

5. Use SSH Key Authentication

On your local machine:

ssh-keygen -t rsa -b 4096 -f ~/.ssh/sftpuser1_key

On your server:

sudo mkdir -p /home/sftpuser1/.ssh
sudo nano /home/sftpuser1/.ssh/authorized_keys
# Paste public key here

sudo chown -R sftpuser1:sftpusers /home/sftpuser1/.ssh
sudo chmod 700 /home/sftpuser1/.ssh
sudo chmod 600 /home/sftpuser1/.ssh/authorized_keys

You can now disable password login if you wish.

6. Secure the Server

Open port 22 to only your office or VPN IP
Install fail2ban:

sudo apt install fail2ban

Consider using logrotate and basic audit logging

7. Test the Setup

From your local terminal:

sftp -i ~/.ssh/sftpuser1_key sftpuser1@<your-linode-ip>

Then:

cd /upload
put testfile.txt

Tip: SFTP isn’t a shell. You can’t run cat or echo — just put, get, ls, etc.

Why Public Subnet?

Because this server needs to be accessed from the internet. If it were in a private subnet, you’d need a bastion or VPN to reach it — useful for internal automation, but not external sharing.

Just like with your previous setup:

The public subnet gives controlled external access
Security is enforced via firewall + SSH key access

Lessons Reinforced

Chroot directories must be owned by root
SFTP can be a secure alternative to email attachments or third-party tools
You can still own your file flows in a modern, cloud-native way

Next Steps

For future improvements and personal learning growth:

Automate uploads from other services or cron jobs
Pipe incoming files into a processing queue (e.g., via inotify or systemd)
Back up uploaded files to S3
Add a DNS record if you want: sftp.yourdomain.com → your Linode IP

Final Thoughts

Owning your infra doesn't mean reinventing everything — it means understanding the tradeoffs and being able to build what you need, when you need it. This is one more building block toward that confidence.

You’ve got this. Nothing is impossible

Reclaiming Engineering Ownership: A Hands-On Guide to Bare-ish-Metal Cloud

JOOJO DONTOH — Fri, 25 Jul 2025 06:13:05 +0000

Introduction

I decided to write this after diving into the ongoing conversation around learned helplessness in software engineering. This is something David Heinemeier Hansson (creator of Ruby on Rails) has been very vocal about especially on his twitter. His points may resonate with you depending on your business needs and the layers of abstraction you are willing to take on. I've seen so many companies rack up huge cloud bills, and that can easily convince smaller teams that they need to do the same to be “serious.” But a lot of this complexity is sold to us by vendors whose business depends on making things look harder than they really are, A.K.A "Merchants of complexity"

Learned helplessness, in this context, happens when engineering teams slowly lose the ability, or even the confidence to manage and understand their own infrastructure. Over time, everything becomes someone else’s service: databases, hosting, even cron jobs. And when that happens, teams risk losing technical depth, the ability to troubleshoot under pressure, and even the curiosity that drives real innovation.

The truth is, setting up your own infrastructure isn’t always as hard or as costly as it seems. Sometimes, going hands-on—provisioning your own servers, configuring your own network—can teach you more and cost you less. This article walks through a high-level, hands-on setup of a simple app using IaaS-level tools (like VPSs, subnets, and Apache proxies), not because it’s the “right” way for every project, but because understanding the layers beneath the abstraction gives you real control—and that’s a power every engineer should have.

All of this has led me to revisit the foundational layers of cloud infrastructure, not to throw shade at modern abstractions, but to get a clearer picture of what they’re built on. In this article, I’ll walk through a high-level setup of a simple, non-production-ready web app. This focuses purely on the purpose of learning and technical understanding. It’s a hands-on journey that starts at the Infrastructure-as-a-Service (IaaS) layer, the lowest abstraction tier in cloud computing (the others being PaaS and SaaS).

Most of us have used cloud platforms in some form, whether it’s deploying serverless functions like AWS Lambda or Firebase Cloud Functions, or using tools like Heroku or Vercel that abstract away orchestration entirely. But beneath all that convenience lies real, raw infrastructure: virtual machines, subnets, proxies, and firewalls. This article is a small tutorial that dives into exactly that. I’ll also drop “nuggets” throughout. Basically pointers to deepen your understanding if you’re curious to dig further at any point in the process.

Resources Used for This Exploration

To keep things practical and grounded, I built a small full-stack notes application that serves as the foundation for the tutorial. The frontend is a Next.js application, responsible for rendering UI and communicating with the backend via API calls. The backend is a lightweight Node.js app built with Koa, handling user authentication and CRUD operations for notes. For data storage, I used a PostgreSQL database containerized and hosted within the private subnet. Everything runs on Akamai’s cloud infrastructure—specifically their VPC and Linode offerings—which provide just enough control to explore low-level networking, subnetting, and proxy setups, without overwhelming complexity.

Provisioning a VPC and Partitioning Your Network

The first step is to rent a VPS from a provider that gives you fine-grained control over networking—options include DigitalOcean, Linode, and AWS EC2. For this project, I chose Akamai’s Linode platform, which allowed me to create a Virtual Private Cloud (VPC) and define custom subnets. I partitioned the network into two subnets: a public subnet that can access the internet (ideal for hosting the frontend), and a private subnet that has no direct internet access (reserved for backend services and the database). When creating your subnets, you’ll need to allocate CIDR blocks to define the IP ranges. For example, the public subnet could use 10.0.1.0/24, while the private subnet could use 10.0.2.0/24. These ranges should be chosen with future growth and IP efficiency in mind. A good practice is to size your subnets based on how many nodes or services you expect to scale into each zone.

💡 Nugget: Take some time to explore how CIDR blocks work, how IP addresses are distributed, and why certain ranges are considered private. It’s a foundational concept for understanding modern networking.

Creating Firewalls to Enforce Subnet Isolation

With your subnets in place, the next step is to enforce network boundaries using firewall rules. Firewalls allow you to control which traffic is allowed to enter or leave a node based on IP ranges, ports, and protocols. For this tutorial, we’ll design our firewall to completely isolate the private subnet from the internet, while exposing only the necessary ports in the public subnet. Let’s break this down into inbound and outbound rules.

Inbound Rules

Inbound rules govern what kind of traffic is allowed into your nodes.

Default Deny:
By default, deny all inbound traffic. Only allow what’s explicitly needed. This is the safest baseline.
ICMP for Testing (Optional):
You may want to temporarily allow ICMP (ping) traffic to help debug connectivity.

Protocol: ICMP
Source: 0.0.0.0/0 (or your own IP)
Action: Accept

Public Subnet — Web Access (Frontend App): Your public-facing frontend must be reachable from the internet.

Ports: 80 (HTTP), 443 (HTTPS)
Protocol: TCP
Source: 0.0.0.0/0
Action: Accept

Private to Public — Forward Proxy Access: To allow your private nodes (backend/DB) to make outbound requests via the public proxy, you must enable inbound access on the proxy port in the public node.

Port: 8080 (or your chosen proxy port)
Protocol: TCP
Source: 10.0.2.0/24 (Private subnet IP range)
Action: Accept

Private Subnet — Internal Communication (Backend ↔ DB): Backend services in the private subnet need to talk to each other, especially to your database node.

Port: e.g., 5432 (PostgreSQL)
Protocol: TCP
Source: 10.0.2.0/24
Action: Accept

SSH Access for Maintenance: You’ll want to be able to SSH into your Linodes for debugging or setup.

Port: 22
Protocol: TCP
Source: Your public IP (or 0.0.0.0/0 if unrestricted, though this is not recommended)
Action: Accept
💡 Tip: For security, restrict this to your personal IP only.

💡 Nugget: SSH uses asymmetric cryptography—your private key remains on your machine, while the public key is added to the server’s ~/.ssh/authorized_keys. Understanding this is essential when managing key-based access securely.

Outbound Rules

Outbound rules control what kind of traffic your nodes are allowed to initiate.

ICMP for Testing (Optional): Allow outbound ping (ICMP) for basic connectivity tests.

Protocol: ICMP
Destination: 0.0.0.0/0
Action: Accept

Internet Access (Public Subnet Only): Allow HTTP and HTTPS requests from the public subnet.

Ports: 80, 443
Protocol: TCP
Destination: 0.0.0.0/0
Action: Accept

Private Subnet via Proxy (Handled Later): The private subnet won’t have direct internet access. Instead, outbound requests will go through the forward proxy configured on the public node. We’ll configure this in a later section.

This firewall setup ensures that your private services are protected, your frontend is accessible, and your infrastructure remains tightly controlled. Always test your rules incrementally—misconfigurations are common but easily fixed if introduced step-by-step.

An example of inbound rules:

Action	Source (IPv4)	Port	Purpose
ACCEPT	`0.0.0.0/0`	80	Public HTTP access (frontend)
ACCEPT	`0.0.0.0/0`	3000	(Optional) direct access to frontend’s internal port (e.g. for testing or bypassing Apache)
ACCEPT	`0.0.0.0/0`	443	Public HTTPS access
ACCEPT	`0.0.0.0/0`	-	ICMP for ping/testing
ACCEPT	`0.0.0.0/0`	22	SSH access (can be restricted to your personal IP)
ACCEPT	`10.0.1.2/32`	8000	Backend service from public proxy
ACCEPT	`10.0.2.0/24`	8080	Proxy communication from private subnet to public proxy
ACCEPT	`10.0.2.0/24`	5432	PostgreSQL access for backend services in the same subnet

Example of Outbound rules:

Action	Destination (IPv4)	Port	Purpose
ACCEPT	`0.0.0.0/0`	443	HTTPS (package installs, certs, APIs)
ACCEPT	`0.0.0.0/0`	80	HTTP (package installs, etc.)
ACCEPT	`0.0.0.0/0`	-	ICMP (ping)

Setting Up Your First Linode (Public Subnet)

With your VPC and firewall rules in place, it's time to spin up your actual infrastructure nodes—starting with a public-facing Linode. This Linode will act as the gateway to your application, serving your frontend (and optionally proxying to your backend), and it’s where we’ll verify that your network and firewall setup is working correctly.

To begin, provision a low-cost Linode (around \$5/month at the time of writing) and assign it to your public subnet. Make sure to also attach the firewall you previously configured, so that all the carefully crafted rules now apply to this instance.

Once deployed, you’ll need to access the machine. You can do this in two ways:

SSH into the Linode using your terminal and the public IP.
Use LISH (Linode Shell) from the Akamai console if you don’t have SSH access yet.

Inside your Linode, perform a few critical connectivity tests to ensure your networking is correctly configured:

Test Internet Access Run a simple ping to Google’s DNS to verify outbound access is allowed by your firewall:

   ping 8.8.8.8

Update the Package Index This verifies that outbound HTTPS is working and you can install packages:

   sudo apt update

If both commands succeed, your firewall and public subnet are configured correctly. You now have a functioning public node, fully capable of installing software, serving applications, and acting as a forward or reverse proxy for your private subnet. This will serve as the entry point to your application infrastructure as we move forward.

Setting Up a Private Linode (Backend Node)

Next, provision your private Linode, which will host your backend application. Just like the public node, this Linode is also very affordable (around \$5/month), but unlike the public node, this one will not be assigned a public IP address—ensuring it has no direct access to or from the internet.

When creating this Linode:

Assign it to the private subnet you created earlier.
Do not assign a public IP address. This isolation is intentional: your backend should only talk to the frontend and the database, not the public internet.
Ensure your firewall rules allow this node to communicate with:

The public node (for outbound access via proxy)
Other private nodes like the DB server

Since the private node will not have internet access, you’ll configure a forward proxy later in the article—hosted on the public Linode—to help it install packages or make outbound HTTP requests securely and indirectly.

In addition to forward proxying, you’ll also need a reverse proxy, typically using Apache, to:

Route external requests to the appropriate internal services (e.g., /api to the backend)
Handle SSL termination and clean URL routing

To configure Apache as a reverse proxy, you’ll later modify the default site config file:

sudo nano /etc/apache2/sites-available/000-default.conf

or better yet, create your own virtual host file to keep things modular and clear.

Buy and Configure a Domain

To make your app feel more “real” and not just accessible by an IP, you should purchase a cheap domain (e.g., from Namecheap or Google Domains). Once you own a domain:

Create an A Record that maps your domain (e.g., notes.online) to the public IP of your frontend Linode.
Once DNS propagation completes, install a free HTTPS certificate via Let’s Encrypt by running:

   sudo certbot --apache -d yourdomain.com

If your DNS records are correctly set up, this command will:

Validate domain ownership
Automatically install an HTTPS cert
Store it in /etc/letsencrypt/live/yourdomain.com

This setup ensures that your frontend can be accessed securely via your custom domain, and that traffic can be reverse-proxied to your backend securely—all while maintaining strict network isolation between tiers.

Setting Up a Private Linode for the Database (PostgreSQL)

The final component of your infrastructure setup is the database node, which will also reside entirely in your private subnet. This ensures that your data is not exposed to the public internet and can only be accessed by other internal services—specifically, your backend application.

Here’s how to set it up:

Provision a new Linode (again, a \$5/month plan will suffice).
Assign this Linode to the private subnet, just like your backend node.
Apply the same firewall to this node so that only traffic from the private subnet—especially your backend—can reach it.
Do not assign a public IP. Your DB should never be exposed to the internet.

Once your Linode is up and running, SSH into it using LISH (if it doesn't have internet access) or set up a jump box via your public Linode, and install PostgreSQL:

sudo apt update
sudo apt install postgresql postgresql-contrib -y

After installation, configure PostgreSQL to accept connections over the private subnet:

Step 1: Update `postgresql.conf`

This file controls PostgreSQL’s runtime behavior. You need to allow it to listen for connections beyond localhost:

sudo nano /etc/postgresql/14/main/postgresql.conf

Find the line:

#listen_addresses = 'localhost'

Replace it with:

listen_addresses = '*'

This tells PostgreSQL to listen on all network interfaces—including the private IP assigned by the subnet.

Step 2: Update `pg_hba.conf`

This file defines who can connect, from where, and how they authenticate.

sudo nano /etc/postgresql/14/main/pg_hba.conf

At the bottom, add a rule that allows incoming connections from the private subnet:

host    all             all             10.0.2.0/24            md5

This means: allow all users to connect to all databases from any machine in the private subnet using password (md5) authentication.

💡 Nugget: Spend time reading about how postgresql.conf and pg_hba.conf interact. They are the gatekeepers of your DB’s network exposure and authentication model.

Step 3: Restart PostgreSQL

Apply your configuration changes:

sudo systemctl restart postgresql

Your PostgreSQL instance is now fully isolated within the private subnet and only reachable by other nodes in the same subnet—specifically your backend node. You’ve effectively recreated a secure, cloud-style VPC networking setup, but on your own terms, for a fraction of the cost.

Enhance Communication Between Nodes and the Internet

In previous sections, we intentionally designed the infrastructure so that nodes in the private subnet do not have direct access to the internet. This is a common and recommended security posture—but it introduces a challenge: how do backend services fetch updates, install packages, or interact with external APIs?

The solution is to introduce a forward proxy in the public subnet. This allows private nodes to send outbound traffic via a trusted middleman (the public node), without exposing themselves directly to the internet.

Why Not Use Basic NAT?

While it's tempting to set up a 1:1 NAT (Basic NAT) for simplicity, this approach bypasses the layered security model we’re trying to build. It essentially grants your private nodes direct exposure, undermining the purpose of subnet separation.

🔗 This is the official documentation of the Akamai Forward Proxy

Set Up a Forward Proxy with Apache (on the Public Node)

Let’s walk through setting up a proper forward proxy using Apache.

Step 1: Access the Public Linode

SSH into your public-facing Linode (in the public subnet) or use the LISH console via Akamai’s dashboard.

ssh root@<your-public-linode-ip>

Step 2: Install & Prepare Apache

Update your packages and ensure Apache is installed:

sudo apt update -y
sudo apt install apache2 -y

Enable necessary Apache modules:

sudo a2enmod proxy proxy_http proxy_connect

Step 3: Create a Forward Proxy Configuration

Open a new config file:

sudo nano /etc/apache2/sites-available/fwd-proxy.conf

Paste the following configuration (adjust IPs as needed):

# Listen on the internal IP (public node) at port 8080.
# This sets up the Apache server to accept proxy requests from the private subnet via port 8080.
Listen 10.0.2.2:8080

<VirtualHost *:8080>
    # Admin email for server issues (not strictly required unless you're sending error reports).
    ServerAdmin webmaster@localhost

    # Root directory for served files (not used in proxying but required syntactically).
    DocumentRoot /var/www/html

    # Log errors from proxy activity here (useful for debugging)
    ErrorLog ${APACHE_LOG_DIR}/fwd-proxy-error.log

    # Log all access through the proxy
    CustomLog ${APACHE_LOG_DIR}/fwd-proxy-access.log combined

    # Enable forward proxy mode (Apache acts as a middleman for outbound traffic)
    ProxyRequests On

    # Adds headers like Via: to show request went through a proxy (useful for tracing)
    ProxyVia On

    # Restrict proxy access to only IPs from the private subnet
    <Proxy "*">
        Require ip 10.0.2.0/24
    </Proxy>
</VirtualHost>

💡 Nugget: The ProxyRequests On directive enables forward proxying. The <Proxy "*"> block restricts usage of this proxy to requests originating from your private subnet only.

Save and close the file.

Step 4: Enable the Proxy Site and Restart Apache

sudo chown root:root /etc/apache2/sites-available/fwd-proxy.conf
sudo chmod 0644 /etc/apache2/sites-available/fwd-proxy.conf
sudo a2ensite fwd-proxy
sudo systemctl restart apache2

Test the Proxy

From a private Linode, you can now route outbound traffic through the proxy:

curl -x http://10.0.2.2:8080 https://example.com

If successful, your private node is now securely communicating with the internet through your public node—without needing a public IP of its own.

This setup retains your network isolation while still enabling secure, auditable internet access.

Enhance Communication Between Nodes and the Internet (Part 2: Private Nodes)

Now that your forward proxy is configured and running on the public Linode, it’s time to set up your private nodes—specifically the backend and database Linodes—to route their outbound internet traffic through this proxy.

Backend Private Node Configuration

On your backend node in the private subnet (which should not have direct access to the internet), you’ll need to explicitly configure it to use the forward proxy you previously set up on the public Linode (e.g., 10.0.1.2:8080).

Start by configuring the proxy settings for apt, so you can perform package installations via the proxy:

echo 'Acquire::http::proxy "http://10.0.2.2:8080";' | sudo tee /etc/apt/apt.conf.d/proxy.conf

Once done, test it using:

sudo apt update

You can also test general HTTP traffic routing through the proxy with:

curl --proxy 10.0.2.2:8080 http://example.com

If both work as expected, proceed to route all HTTP/HTTPS traffic system-wide through the proxy. This ensures that any application or system utility that needs external access will use the proxy automatically.

Edit the /etc/environment file to export proxy variables globally:

sudo nano /etc/environment

Then add:

http_proxy="http://10.0.1.2:8080"
https_proxy="http://10.0.1.2:8080"
no_proxy="localhost,127.0.0.1,::1"

These environment variables will persist across sessions and reboots. For them to take full effect, reboot the node:

sudo reboot

Setup Applications and Storage

With your networking, firewall, and proxy configuration complete, the next step is to deploy your applications on the respective nodes. This section guides you through cloning, configuring, and starting both the frontend and backend apps used in this tutorial, along with their storage setup.

Public Linode – Frontend Application

Your public Linode is where the frontend (Next.js) application will live, and it’s accessible to the outside world via your domain and Apache reverse proxy.

Follow these steps:

Run an update on your packages:

   sudo apt update

Install Git:

   sudo apt install git

Clone the frontend repository made specifically for this tutorial:

   git clone https://github.com/Joojo7/notes-app-frontend
   cd notes-app-frontend

Install the required Node.js dependencies:

   npm install

There’s no need to over-engineer this with a CI/CD pipeline, since it’s a one-off learning project. You can start the app directly or with a tool like pm2 if you want to keep it alive in the background.

Private Linode – Backend Application

On your private backend Linode, follow similar steps, with additional backend-specific setup:

Update the system and install Git:

   sudo apt update
   sudo apt install git

Clone the backend repository:

   git clone https://github.com/Joojo7/notes-app-backend
   cd notes-app-backend

Ensure you have the following installed:

Node.js (v18 or higher)
npm
Docker + Docker Compose

Create a .env file at the root of the project. You can copy from .env.example and customize as needed:

   DB_USER=your_db_user
   DB_PASSWORD=your_db_password
   DB_HOST=localhost
   DB_PORT=5432
   DB_NAME=notes_db
   JWT_SECRET=your_jwt_secret
   JWT_EXPIRATION=15m
   PORT=8000

To simplify the startup process, a startup.sh script has been provided. Make it executable and run it:

   chmod +x startup.sh
   ./startup.sh

This script will handle the Docker Compose setup and the backend service bootstrapping.

More details are available in the backend repo’s README:
🔗 notes-app-backend GitHub

Setup Applications and Storage (continued)

Serve and Test the Application

Once both the frontend and backend are installed and configured, it’s time to serve the application to the internet and test its full flow. The public Linode will act as a reverse proxy, routing requests to the appropriate services via Apache.

Reverse Proxy from Domain to Frontend (and Backend API)

To serve your frontend from https://yourdomain.com without needing to append a :3000 port, we’ll configure Apache as a reverse proxy.

Steps to Enable Apache Modules

SSH into your public Linode and enable the necessary Apache modules:

sudo a2enmod proxy
sudo a2enmod proxy_http
sudo a2enmod headers
sudo a2enmod rewrite

Then restart Apache to apply the changes:

sudo systemctl restart apache2

Configure Apache Reverse Proxy for HTTP and HTTPS

Create or modify a site configuration file:

sudo nano /etc/apache2/sites-available/yourdomain.com.conf

Paste the following configuration:

HTTP (Port 80) – used for initial redirect or Certbot challenge:

<VirtualHost *:80>
    ServerAdmin webmaster@localhost
    DocumentRoot /var/www/html

    ErrorLog ${APACHE_LOG_DIR}/reverse-proxy-error.log
    CustomLog ${APACHE_LOG_DIR}/reverse-proxy-access.log combined

    ProxyRequests Off

    # Reverse proxy to backend API
    ProxyPass /api/notes/ http://10.0.2.2:8000/
    ProxyPassReverse /api/notes/ http://10.0.2.2:8000/

    <Proxy *>
        Require ip 10.0.2.0/24
    </Proxy>
</VirtualHost>

HTTPS (Port 443) – full site access and secure reverse proxy:

<IfModule mod_ssl.c>
<VirtualHost *:443>
    ServerAdmin webmaster@localhost
    ServerName yourdomain.com
    DocumentRoot /var/www/html

    ProxyRequests Off

    # Reverse proxy to backend API (must come before frontend proxy)
    ProxyPass /api/ http://10.0.2.2:8000/
    ProxyPassReverse /api/ http://10.0.2.2:8000/

    # Reverse proxy to Next.js frontend
    ProxyPass / http://localhost:3000/
    ProxyPassReverse / http://localhost:3000/

    # Allow Certbot HTTP challenge
    <Location /.well-known/acme-challenge>
        Require all granted
    </Location>

    # SSL configuration (provided by Certbot)
    SSLEngine on
    SSLCertificateFile /etc/letsencrypt/live/yourdomain.com/fullchain.pem
    SSLCertificateKeyFile /etc/letsencrypt/live/yourdomain.com/privkey.pem
    Include /etc/letsencrypt/options-ssl-apache.conf

    # Logging
    ErrorLog ${APACHE_LOG_DIR}/reverse-proxy-error.log
    CustomLog ${APACHE_LOG_DIR}/reverse-proxy-access.log combined
</VirtualHost>
</IfModule>

Final Steps

Enable the new site config:

   sudo a2ensite yourdomain.com.conf

Disable the default config (optional but recommended):

   sudo a2dissite 000-default.conf

Reload or restart Apache:

   sudo systemctl reload apache2

DNS Setup and Request Flow

Go to your domain provider (e.g., Namecheap, GoDaddy, etc.):
1. Add an A record that points your domain (e.g., yourdomain.com) to the public IP of your public Linode.
2. Wait for DNS propagation — this may take anywhere from a few minutes to a few hours depending on TTL settings.
After DNS is live, the request flow will look like this:

Client Browser
    ↓
DNS Lookup (yourdomain.com resolves to public IP)
    ↓
Firewall (allows ports 80/443 to Apache)
    ↓
Apache Web Server (reverse proxy)
    ↓
    - /api/ → Private backend service via 10.0.2.2:8000
    - /     → Local frontend served from port 3000

Now visit https://yourdomain.com — your frontend should load without the port, and API requests to /api/notes should proxy correctly to the backend on the private node.

Future Improvements and Deep Dives to Sharpen Your Understanding

First off—if you’ve made it this far, take a moment to acknowledge what you’ve accomplished. You’ve not only provisioned infrastructure at the IaaS level, but also configured firewalls, private/public subnets, secure proxies, reverse routing, and a full-stack deployment—all from scratch. That’s huge.

But this journey doesn’t end here. There’s so much more to explore, and none of it is out of reach.

Add More Services for Realistic Environments

Now that you’ve successfully deployed a frontend, backend, and database, try introducing additional Linodes to simulate multi-service architectures. For example:

Add a Redis node, a message queue, or a monitoring service like Prometheus or Grafana.
Observe how service-to-service communication happens over private IPs.
Practice managing security and performance as your ecosystem grows.

Learn About Load Balancing

Load balancing is a cornerstone of high-availability systems.

Study how Apache or Nginx can distribute requests across multiple backend servers.
Try simulating stress or high traffic to watch your load balancing strategy in action.
Experiment with sticky sessions, round-robin, and IP-hash load balancing techniques.

This is how production-scale infrastructure starts.

Rebuild the Proxy Setup in Nginx

Everything you’ve configured with Apache can be reimagined in Nginx.

Learn how to configure reverse proxies and forward proxies in Nginx.
Explore advanced modules like ngx_http_proxy_connect_module for forward proxying.
Compare the verbosity, performance, and control between Apache and Nginx.
You’ll appreciate Apache more—and gain a new appreciation for Nginx too.

You’re not starting from zero anymore. You now have mental models to guide you.

Prepare for Production-Like Workflows

Imagine if this app had users. Or stakeholders. Or deadlines. What would you automate?

Practice triggering deployments from GitHub via webhooks.
Explore CI/CD pipelines with tools like GitHub Actions, ArgoCD, or Terraform.
Think about container orchestration and start reading up on Kubernetes or Nomad.
Look into secrets management, versioned configuration, or observability tooling.

These are things you’ll naturally grow into—and now you know where to start.

Final Thoughts

This wasn’t just a tutorial. It was a walk down the forgotten path of technical self-reliance—the kind that builds confidence, clarity, and curiosity.

Yes, modern SaaS and PaaS platforms are convenient—but they abstract away the very systems we’re responsible for. Sometimes, by getting your hands dirty and walking a little closer to the metal, you reclaim something powerful: your understanding.

So, keep tinkering. Keep asking questions. Keep exploring.

You don’t need a million-dollar cloud budget to learn this stuff.
You just need a \$5 Linode, some grit, and a healthy dose of curiosity.

You’ve got this.

Building practical workflows: data observability, AI trend analysis & proactivity.

JOOJO DONTOH — Sun, 29 Jun 2025 10:15:55 +0000

Introduction

Today I want to talk about data observability and a few other things. My team and I have continuously worked on this section of our workflow I would like to share something about it. To understand data, one must first recognize that it is not merely an output, but a reflection of the logic that has been executed across systems. Every user action, system response, or triggered event leaves behind a residue in the form of recorded information. Rather than existing in abstraction, data carries the imprint of the logic that shaped it, effectively turning each row, record, or object into a timestamped decision made by code.

But working with data isn’t always smooth. Problems often arise when different systems store the same data differently, leading to confusion about which version is correct. Logic applied inconsistently across services creates more gaps. Sometimes data is left without clear ownership, making it hard to maintain. As systems grow, understanding what the data means — and how to work with it — becomes harder. And when teams rely on external tools or manual steps to combine or process data, the risk of mistakes increases.

These issues highlight why data observability matters. Without it, teams can’t easily tell where problems come from or whether their data can be trusted. Observability gives clarity. It helps teams understand how data flows, where it breaks, and how to fix it before it becomes a bigger issue.

What Is Data Observability?

Data observability is the ability to monitor the health and behavior of data as it moves through a system. At its core, it ensures that data is accurate, consistent, and reliable. This means spotting when data is missing, outdated, duplicated, or corrupted — and knowing where and why it happened.

With strong observability in place, teams can quickly detect issues and trace them back to the root cause. For example, if a report shows incorrect numbers, observability makes it easier to see whether the issue came from a failed data load, a logic error, or a stale source. Instead of guessing, teams can investigate with confidence and resolve problems faster.

Beyond fixing errors, data observability plays a key role in decision-making. When teams trust the data, they can make faster and more informed choices. From refining product strategies to debugging subscription flows or interpreting performance metrics, good data leads to better outcomes and observability is what makes that trust possible.

Understanding Data Flow

To practice data observability effectively, a team must first understand how data flows through their system. This means tracking how data is created, how it changes over time, and where it ends up. Without this awareness, it’s difficult to catch issues or explain unexpected results.

Every piece of data goes through different states. For example, a subscription might start in a PENDING_ACTIVATION state, move to ACTIVE, and eventually become EXPIRED or CANCELLED. Each of these states has a meaning tied to business logic. PENDING_ACTIVATION might signal that a user has initiated a subscription but hasn’t yet activated it. EXPIRED could mean the subscription ended naturally, while CANCELLED might indicate a user-initiated termination or a system-triggered rollback.

It’s also important to define how long data is expected to stay in each state. A record stuck in PENDING_ACTIVATION for more than 24 hours might be a red flag. Without defined time windows, teams won’t know when data is stale or whether something has failed silently.

Equally critical is understanding the transition routes — how data moves from one state to another. Tracking these transitions creates transparency and accountability. The best way to do this is through change history. A well-structured change history logs not just what changed, but also metadata around the change: who or what triggered it, when it happened, and why. The ideal structure includes:

Metadata (timestamp, source, actor),
Old Data (previous state or value),
New Data (the updated state or value).

[
  {
    "metadata": {
      "partnerSubscriptionId": "XXXXXXXXXXXXXX",
      "subscriptionEndDate": 1749200000000,
      "smc": "*********",
      "hhid": "*********",
      "salesChannel": "CHANNEL_X",
      "packId": "generic-pack-id",
      "createdAt": 1746000000000,
      "partner": "PARTNER_X",
      "assetId": "GenericAsset",
      "client": "INTERNAL_SYSTEM",
      "subscriptionId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
      "subscriptionStartDate": 1746000000000,
      "userProductSubscriptionId": "XXXXXXXXXXXXXX"
    },
    "hhid": "XXXXXXXX",
    "operation": "MODIFY",
    "puid": "anonymous",
    "loggedAt": 1749200000000,
    "accountId": "anonymous",
    "subscriptionId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "table": "subscription_table",
    "smc": "XXXXXXXXXXXX",
    "apId": "anonymous",
    "oldData": {
      "subscriptionStatus": "ACTIVE",
      "autoRenew": true,
      "updatedAt": 1746000000000
    },
    "subscriptionStatus": "EXPIRED",
    "assetId": "GenericAsset",
    "partner": "PARTNER_X",
    "SK": "CHANGE#1749200000000",
    "newData": {
      "subscriptionStatus": "EXPIRED",
      "autoRenew": false,
      "updatedAt": 1749200000000
    },
    "id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "packId": "generic-pack-monthly"
  }
]

With this level of tracking, teams gain a clear view into the life cycle of any data point, making root cause analysis, debugging, and auditing significantly easier.

Consolidation and Aggregation

Healthy data starts with the ability to see the full picture — not just fragments scattered across systems. In modern architectures, it's common for information about a single entity to live in multiple datastores, maintained by different services. Without aggregation, each piece of data remains incomplete, and insights drawn from them are at best limited, at worst misleading.

To make data useful, teams must consolidate it across both internal and external sources. This requires a clear understanding of actor profiles which in some ways are a logical grouping of all relevant data tied to a single subject, such as a user, account, or device. Without this profile view, systems remain reactive and siloed. Nobody wants silos

In our case, aggregation is performed on demand in specific contexts. For example, when we retrieve information tied to a user's smart card, we gather data from several sources at once:

Subscription data, stored internally and reflecting the user's current and historical subscriptions.
Entitlement data, calculated dynamically through CRM logic that applies partner-specific rules, eligibility criteria, and service configurations.
Partner-related data, which may be sourced externally and used to contextualize how the user interacts with third-party services.

Each of these datasets plays a role in shaping the full state of the user. Without aggregation, teams would have to manually stitch these pieces together which is too slow and fragile to support modern operations.

Additionally, system configurational data — feature flags, environment-specific settings, or service-level parameters — must be easy to access and interpret. When teams have quick visibility into the configuration that shaped a data point, debugging becomes faster and business behavior easier to explain. It’s not enough to track the data alone; we must also track the rules that govern how that data behaves.

Understanding Data Freshness and Completeness

Healthy data must not only be accurate — it must also be timely and complete. This is especially true in environments where data is sourced from multiple partners and systems. For our team, freshness refers to how recently the data was updated and how reliably it reflects the current state of CRM activity (our source of truth). When working with external integrations, it’s important to recognize that not all systems operate in real time, so decisions must be made about how often data is fetched, transformed, and loaded.

These decisions directly tie into ETL pipeline design. Striking the right balance between data consistency and performance is key. Trying to always stay perfectly in sync with every upstream system can create unnecessary load or latency. On the other hand, overly infrequent updates can make the data stale and unusable.

To address this, our ETL pipelines are designed to scale both logically and operationally:

Extraction is often handled asynchronously by standalone services or serverless functions with dedicated resources. Since this stage can be resource-intensive, especially when pulling from partner APIs or scanning large internal datasets — it’s decoupled from real-time workflows. This decoupling is important to maintain availability and durability of your real time workflows. The frequency of extraction is tuned based on the freshness requirement of each data source. Some may be pulled hourly, others daily, depending on how critical and volatile the data is.
Transformation includes various forms of computation which include simple mapping to statistical aggregations like totals, averages, and distributions. In our system, partner-specific subscription data is transformed concurrently, using context-aware processing that segments workloads by partner to avoid bottlenecks. Depending on complexity and resource cost, these transformations either happen within the extraction step or are delegated to separate transformation functions.
Load is the final stage, where the processed data is stored. For most of our needs, a single structured JSON file stored in S3 suffices, given that the data is precalculated and intended for read-heavy use cases. To improve performance, we place a read-through cache in front of this storage, allowing downstream consumers to access the latest data quickly. Whenever the load process completes, the cache is also cleared and refreshed, ensuring consistency between stored data and what consumers read.

Completeness is another pillar of data health. Often, the focus is on sanitizing data at entry points — validating input, enforcing schemas, ensuring type safety. But sanitization during retrieval is just as important, especially in systems where manual edits, migrations, or external syncs might have bypassed initial validation. We treat both entry and exit as critical points for enforcing data standards, catching missing attributes, and preserving structural integrity.

Without freshness, data becomes misleading. Without completeness, it becomes fragile. Observability into both helps teams ensure that what they’re seeing is both recent and whole.

Creating Avenues to View Data Health

Observability is only useful when data health can be inspected both broadly and in context. For a team to react quickly to issues, understand root causes, or maintain confidence in their systems, there must be clear and accessible ways to monitor how data is behaving over time.

Viewing data health can happen at two levels: system-wide or context-specific. A broad system view might highlight trends, such as an increase in failed data loads or a drop in expected event volumes. But often, the most meaningful insights come from zooming into a specific actor or data entity — seeing what happened, when, and why.

In our systems, this context-based observability takes on many forms:

User subscriptions are a core entity we track. Each subscription carries a lifecycle — from activation to renewal to expiration — and understanding the health of this data involves checking if transitions occurred as expected, if timestamps align, and if associated metadata (like auto-renew flags or entitlement links) are correct and intact. If a subscription appears stuck or missing key attributes, it may indicate a broader system issue.
Scheduled actions are another important context. These are time-driven operations like renewals, cancellations, or retries. To understand their health, we support queries across time windows and statuses — such as identifying actions that were QUEUED but never EXECUTED, or that failed unexpectedly. Being able to slice this by partner, product, or status allows teams to quickly isolate patterns and respond.

Partner events, which are signals from external systems, form another layer of contextual health. These events might indicate that a user has activated a service, consumed content, or encountered an error. We monitor if these events are received, verify that they’re parsed accurately, and ensure downstream systems respond as intended. When expected events go missing or arrive malformed, it becomes a signal that something upstream may be broken.

By building these contextual views, teams gain the ability to not just observe data — but to understand it. Investigating a single issue or analyzing long-term trends, these views into data health are what separate reactive problem-solving from proactive improvement.

Creating Avenues for Meaningful Data Transition, Extraction, and Storytelling

Raw data, no matter how accurate or complete, becomes significantly more valuable when teams can visualize, interpret, and communicate its meaning. Data storytelling transforms numbers and transitions into narratives that drive understanding, alignment, and action — especially for non-technical stakeholders.

We start with visualization, which is often the most immediate way to surface meaning. Charts help display trends, distributions, and anomalies in a digestible format — whether it's a spike in subscription failures or a dip in partner event delivery. When paired with color-coded statuses, these visualizations can immediately highlight the state of a dataset or flow — for instance, using green for COMPLETED, yellow for PENDING, and red for FAILED — without requiring users to parse detailed logs.

Beyond visuals, we invest heavily in AI-generated summaries to bridge the gap between raw data and human decision-making. Our team uses in-house agents to generate summaries at different levels of granularity:

For user actors, the agent produces insights such as subscription health, recent failures, entitlement mismatches, or eligibility violations.

For partners, another agent compiles metrics and patterns into periodic reports and strategic recommendations, covering usage, errors, and integration health.

We're actively exploring ways to enhance these summaries with memory and context. One improvement involves converting generated summaries into embeddings using NLP(Natural Language Processing) techniques and storing them as vectors. Then, on the next analysis request, the agent could convert the new prompt into an embedding, retrieve the five most similar historical summaries, and enrich the prompt with this context. This approach helps produce better, more informed summaries that evolve over time.

These generated insights are often used in non-technical decision-making, from partner relationship discussions to strategic roadmap planning. For this reason, we also support easy export options — allowing summaries to be copied directly or downloaded as PDFs for distribution in reports and presentations.

To maintain performance and availability, especially under repeated or automated usage, these agents are backed by read-through caches. This prevents overloading the AI systems, reduces latency for frequent queries, and ensures consistency in outputs for the same context.

Ultimately, storytelling is what allows technical data to influence real-world outcomes. By creating tools that present, explain, and share data meaningfully, we ensure it has the power to inform and guide decisions at every level of the organization.

Proactive Data Issue Resolution

While observability helps teams monitor and understand data, the real advantage comes when those insights are used to resolve issues before they escalate. Proactive data resolution means building systems that not only detect anomalies but also guide, automate, or trigger corrective actions across the stack.

The first step involves static logical analysis, which scans data against clearly defined rules to identify structured violations. These are issues that can be caught with deterministic checks — for example, a subscription marked ACTIVE but missing a startDate, or an entitlement with invalid configuration. These checks are currently run on demand during subscription data aggregation. In the future we will automate them to run regularly and help catch data that’s in a broken but detectable state.

More complex problems — especially those involving pattern recognition or inconsistent data relationships — require AI-driven suggestions. These AI agents help identify unstructured violations, such as unexpected spikes in cancellations, or subtle mismatches between entitlements and partner rules. These suggestions are governed by configurable guardrails to ensure they stay within bounds that are understandable and controllable by the team. On the backend, we track prompt consumption, not just for logging and debugging, but to safeguard against misuse, hallucination, or context drift that could compromise model accuracy or security.

Once a violation is detected, either through a rule or an AI-generated suggestion, resolution must be actionable. That’s why we’ve built agents that integrate directly with task management tools like Jira. When an issue is confirmed, these agents can suggest Jira tickets with full context: the data in question, violation type, metadata, and a recommended fix path. This shortens the cycle between detection and accountability, allowing issues to be tracked and resolved in standard engineering workflows.

Another key pillar of proactive resolution is maintaining synchronization between related datastores. In systems where multiple services maintain different views of the same data, desyncs are inevitable. To address this, we intend to use both manual and automated sync pipelines. Some pipelines would run on a schedule to reconcile mismatches, while others can be triggered ad hoc when a drift is manually detected. These processes ensure consistency without requiring constant developer intervention.

Proactivity in data management isn’t just about building alerts — it’s about designing flows that detect, explain, and repair issues at the right level of automation. The result is a system that doesn’t just observe its state, but works to maintain its integrity in real time.

How Data Observability Has Helped My Team

Adopting data observability has had a transformative effect on how our team operates. What once required manual digging, cross-referencing, and tribal knowledge can now be done quickly, visually, and with far more confidence. The biggest shift has been in data surveillance — we now have a clear, consistent view of how data moves and behaves throughout our systems.

One of the most immediate benefits is how easily team members can understand user data profiles. A developer debugging a flow, a QA tester verifying a fix, or a product owner validating a new rule can all inspect the full data picture for any given user without jumping across dashboards or databases. This has made behavioral patterns more traceable, allowing us to detect anomalies like missing activations, frequent subscription failures, or inconsistent entitlement states.

Data gaps such as missing fields, incomplete transitions, or failed triggers now surface visibly, making them easy to flag and investigate early. This has greatly improved our QA workflow, as testers no longer need to manually reconstruct test cases from fragmented logs. Instead, they can validate entire flows from a central point of visibility. Even during UAT, stakeholders can observe how data responds across environments with a bird’s eye view, reducing ambiguity and speeding up feedback cycles. My team is in the process of expanding viewership capabilities of the dashboard to first responders and customer service, which will greatly boost their abilities while helping customers.

Beyond day-to-day operations, observability has helped with tracking one-time activities, such as bulk email campaigns or pre-provisioning subscription entitlements. These kinds of scheduled jobs are notoriously hard to monitor without proper instrumentation. With observability in place, we can now monitor their execution, volume, and any edge-case failures without writing one-off scripts.

Finally, observability has created a direct line of visibility for leadership. High-level statistics, such as total active subscriptions, partner-triggered event rates, or renewal success ratios, are now exposed through curated summaries and dashboards. This allows management to make decisions based on data, without relying on delayed reports or engineering cycles to extract insights.

In short, observability hasn’t just improved how we handle data — it’s improved how the entire team communicates, collaborates, and aligns around it.

Future Improvements in Terms of AI

As our use of data observability matures, we’re looking to expand our AI capabilities with a sharper focus on context-awareness, scalability, and practical integration across teams. One of our main objectives is to maintain a dedicated Small Language Model (SLM) that is trained on internal systems, workflows, and vocabulary. This SLM would act as a lightweight, focused assistant — optimized for our operational context — and continuously refined by internal AI teams to stay aligned with evolving business needs.

A deeper understanding of model management will be essential. Beyond just deploying models, we’re considering a foundation for version control, prompt governance, feedback loops, and performance evaluation in real-world scenarios. The goal is to ensure that the models we rely on not only produce accurate results but also reflect the nuances of our environment and workflows.

We also plan to extend decision-making workflows through AI. This could include suggesting data fixes, detecting and prioritizing anomalies, and recommending operational actions based on historical patterns. These automations wouldn’t replace human decisions, but rather amplify the speed and quality of those decisions, especially in high-volume or high-pressure contexts.

Finally, one of the most exciting frontiers is connecting AI to team priorities and planning. We envision tools that can monitor workstreams, identify friction points, and suggest roadmap improvements based on observed data and recurring pain points. Highlighting areas ripe for automation or surfacing issues that consistently slow down delivery, AI can play a role in shaping strategy, not just supporting it.

Conclusion

Data is no longer just an output of system behavior, it is mostly the foundation on which modern decisions, automation, and user experiences are built. As many systems scale and complexity grows, it becomes crucial not only to collect data but to observe it meaningfully. Data observability ensures that data is complete, accurate, and timely, enabling teams to debug faster, monitor more effectively, and act with confidence.

But observability is only one side of the equation. The other is intelligence — and this is where AI comes in. By summarizing, interpreting, and recommending actions based on observed data, AI allows teams to move from passive awareness to proactive resolution and strategic foresight. Through summaries tailored to users and partners or workflow-integrated agents that assist with decision-making, AI transforms observability from a monitoring tool into a driver of improvement.

Together, data observability and AI form a powerful loop: observability provides the clarity needed to understand the system, while AI brings the intelligence needed to optimize it. The future lies in continuously refining both — building systems that not only see clearly, but think ahead.

Improving Deployment Velocity: How We Rebuilt for Speed and Sustainability

JOOJO DONTOH — Thu, 22 May 2025 09:53:06 +0000

Introduction

When we talk about engineering performance, deployment velocity may be one of the clearest indicators of how effectively a team delivers software. At its core, deployment velocity measures how often code changes are pushed to production. It reflects a team's ability to move fast without breaking things often, respond to change, and continuously improve. High deployment velocity means features, fixes, and improvements reach consumers more quickly, which directly benefits product delivery. For engineers, it creates a healthy rhythm of execution and feedback. It reduces the pressure of large, infrequent releases and gives developers a sense of momentum and progress. When velocity is high and sustainable, it usually points to a team that’s well-organized, technically sound, and empowered to ship confidently.

Tracking What Matters: Deployment as a Reflection of Team Growth

One of the clearest signs of progress we’ve made as a team has been the improvement in our deployment velocity—a reflection not just of speed, but of how well we’ve grown in our ability to plan, execute, and deliver. This success isn’t mine alone, it’s mostly the result of a committed, teachable, and resilient team that embraced change and moved with it. Truly grateful to them. From a measurement standpoint, we were in a good position: our team was already using Jira’s ecosystem effectively, with structured ticket creation, deployment tracking through Bitbucket, and clear release versioning. This meant that we had a consistent stream of data about our work, which gave us a solid foundation to assess progress. Having access to this kind of visibility is crucial as it sets the stage not only for identifying what’s going well, but also for spotting where things might need attention. It helps create a culture where improvement isn’t guesswork—it’s informed and intentional.

To evaluate our delivery progress, I extracted deployment data from Jira’s Deployment Panel and analyzed two distinct 9-month periods: one prior to my joining (October 2023 – July 2024), and one after (August 2024 – May 2025). The analysis focused exclusively on successful production deployments, ensuring that only one deployment per day was acknowledged to avoid overcounting batch releases or automated retries.

We measured progress by calculating the average number of production deployments per week — a clear, time-normalized metric that reflects delivery cadence. In the 9 months before I joined, there were 4 unique and major production deployments, averaging 0.09 deployments per week. In the 9 months following my onboarding, that number grew to 28, with a corresponding weekly velocity of 0.67 deployments. This represents a 0.58 increase in weekly production deployments, or a 633.33% improvement in deployment velocity — a strong signal of enhanced team autonomy, release confidence, and operational maturity.

But these numbers don’t exist in a vacuum. They represent a deeper story of teamwork, trust, and continuous learning. They reflect the changes we made together: better processes, clearer workflows, more confident code, and a shared commitment to improving how we deliver. The steep increase in velocity is also due to enabling deployments for different services while adding a lot more services to the team's portfolio. Tracking this wasn’t about proving a point—it was about understanding our pace, staying accountable, and creating space for sustainable growth.

Meeting the Team That Made It Possible

When I first joined the team, I walked into a group of individuals who, despite their different levels of experience, were deeply committed to getting things done. My manager was a key pillar—resourceful and responsive, always quick to remove blockers and bridge communication with upper management so I could focus on solving problems. Our scrum master brought structure and consistency, especially in cross-team coordination, which was critical for syncing dependencies and moving work forward. I also had the support of two highly detail-oriented QA engineers who ensured we maintained quality even under tight timelines. Then there were the engineers—young, talented, and incredibly teachable. While some were more senior and confident in their technical abilities, others were still finding their footing, but all of them shared a willingness to learn and improve. A few had an impressive grasp of the product and its edge cases, which was a huge help in my early days—they helped accelerate my understanding of the system far more than any documentation could have. Looking back, I’m reminded that transformation doesn’t start with tools—it starts with people. And I was fortunate to walk into a team that had the right mix of curiosity, humility, and heart.

Understanding the Codebase and the System We Serve

When I first met the codebase, I was stepping into a system built to solve a very specific and critical set of problems—managing user viewership access across multiple OTT partners, syncing that with a central CRM, and surfacing valuable data for the analytics team. At its core, the software ensures that when a user is granted access to a service like Prime or Viu, that entitlement is correctly handled, tracked, and communicated across platforms. The stack was familiar: JavaScript (Node.js) on the backend, DynamoDB and RDS for storage, and a broad use of AWS services to handle deployment and orchestration. What made it more interesting, though, was the fact that I joined during a pivotal architectural transition. The team was shifting from a fragmented service-per-OTT model to a unified, partner-agnostic architecture—something that not only streamlined logic, but allowed for better reusability and maintenance. We were also moving away from long-running EC2-based services toward a modular, event-driven architecture powered by AWS Lambda, which significantly reduced costs and simplified scaling.

The codebase itself was structured as a collection of discrete Lambda functions, each mapped to specific handlers and responsibilities. Shared logic and utilities were published and reused across functions using private NPM packages, allowing for cleaner separation and less duplication. The entire deployment flow was managed using the Serverless Framework, which abstracted much of the infrastructure creation. Serverless allowed us to define shared AWS resources—API Gateways, IAM roles, queues, and more—and expose them cleanly across services, making infrastructure both declarative and portable. It was clear that the building blocks were there. The challenge now was to refactor and elevate what existed, without disrupting what already worked.

The Technical Gaps That Slowed Us Down

1. Structure and Duplication

One of the first things I noticed was the lack of a clear and robust file structure. It wasn’t always obvious where functionality lived, and in some cases, versioning was misunderstood. New features were simply added as "v2" or "v3" rather than being named appropriately. More critically, logic was heavily duplicated across the codebase. Similar functions existed in multiple places, often slightly tweaked but essentially performing the same task. This made maintenance time-consuming and error-prone—changing one behavior often meant hunting down and editing several versions of the same logic.

2. Configuration Chaos

The handling of configurations posed a major challenge. Frequently changing values—such as partner IDs, environment toggles, or feature switches—were hardcoded directly in the code. This led to repeated declarations and multiple sources of truth, making even minor updates feel fragile. Without a centralized config management system, engineers had to manually trace where each variable lived and whether it was safe to change—adding unnecessary complexity to what should’ve been routine work.

3. Readability and Coupling

The code itself was often difficult to reason about. Naming conventions lacked consistency, semantics were unclear, and logic wasn’t always placed where you’d expect. This made the onboarding experience slower and raised the cost of every change. On top of that, many components were tightly coupled—meaning a small update in one area could cause unexpected issues elsewhere. Without clear boundaries or separation of concerns, engineers were sometimes forced to write new solutions for problems that had already been solved elsewhere—just because the existing ones weren’t reusable or discoverable.

4. Testing and CI/CD Gaps

Another big contributor to slow delivery was the lack of automated testing. There were no unit tests or integration tests, so regressions were common. Every change carried risk, and confidence was low. The CI/CD pipeline also wasn’t set up to support iterative development. There was no continuous delivery flow, and previous working features in production were sometimes overwritten by newer, unstable releases. These issues made velocity unpredictable, and it was clear that test coverage and release automation needed to be addressed before we could move faster.

5. Environment Bottlenecks

Finally, the absence of a local development and testing environment severely limited parallel work. Engineers were forced to deploy to shared dev or staging environments just to verify basic functionality—often waiting in line to test their code. This not only delayed releases but also introduced friction into everyday development. It was clear that having a local sandbox wasn’t just a convenience—it was a requirement for a healthy, high-velocity engineering workflow.

Operational Gaps That introduced bottlenecks

1. Gaps in Requirement Gathering and Design Planning

Before I joined, there was no dedicated architect or technical lead guiding the product-engineering process. As a result, requirement gathering was often skipped or done informally. Even after stepping in, shifting this habit took time. In the absence of structured discovery, requirements were sometimes misaligned or incomplete—leading to features being built with incorrect assumptions or missing critical edge cases. Key stakeholders were not always engaged early enough, which meant that essential business details were occasionally left out. Additionally, non-functional requirements—like performance, scalability, and maintainability—were rarely discussed, which impacted architectural decisions. There was little focus on translating requirements into thoughtful system designs, leaving modularity, reusability, and extensibility by the wayside.

2. Inefficient QA Feedback Loops

Our testing process also posed a challenge to velocity. Because there was limited automated test coverage and no structured regression suite, QA engineers had to manually retest large parts of the system—even for small changes. This led to longer feedback loops, bottlenecks in the staging environment, and delays in releases. The manual nature of testing also made it difficult to move quickly and safely, especially when features or bug fixes affected shared areas of the codebase. As a result, a lot of time was spent in the validation phase, even for otherwise minor adjustments.

3. Ambiguous or Incomplete Tickets

Many Jira tickets lacked clear acceptance criteria, which caused confusion during implementation and validation. Engineers often had to chase down clarifications or interpret the requirements on their own, which led to misalignment and rework. For QA, the absence of well-defined success criteria made it harder to validate whether a feature was complete or working as intended. This ambiguity not only slowed development—it also created uncertainty around what “done” actually meant, which is critical when working in a fast-paced environment.

Clean up and restructuring

1. Laying the Groundwork

Before diving into any cleanup or restructuring, I dedicated the first few weeks to simply understanding the system and the product. It was important to take a step back and observe—not just the code, but the broader domain we were operating in, how the existing architecture was structured, and where the boundaries between what could be changed and what needed to be worked around actually lay. This initial period was essential for building context: what the service was meant to do, how different OTT integrations functioned, and where the pain points lived—both technically and operationally. I also took time to align with stakeholders on current deliverables and expectations. One of the first pressing tasks was to lead the removal of the payment functionality from our service. This part of the system was no longer relevant as it had been marked for migration to the CRM—and its presence was adding unnecessary complexity and risk. Taking ownership of that cleanup gave me an early opportunity to untangle a critical path, work closely with the team, and begin setting a standard for the kind of change we were about to make together.

2. Reshaping the Codebase

2.1 Establishing a Consistent Foundation

The first step in cleaning up the codebase was to bring in some consistency and formatting discipline. I introduced Prettier across the entire repository and enforced a standard configuration so all contributors were working from the same baseline. This removed noise from pull requests and made the code easier to read and review. While cosmetic, this change set the tone for a more maintainable codebase and gave us a common starting point as we prepared for deeper structural changes.

2.2 Introducing Safe Refactoring Through Testing

Given how coupled and fragile parts of the system were, it wasn’t safe to dive straight into large refactors. To address this, I set up a unit testing framework and added some base tests as a starting point. I then created unit test jobs in the CI pipeline, and hosted a walkthrough with the team to align on how this would work within our development flow. To encourage meaningful adoption, I added a coverage enforcement check that allowed pull requests to pass only if test coverage increased compared to the current baseline. This ensured that every MR helped improve the safety net, bit by bit.

2.3 Reinforcing Testing Through Example and Guidance

To avoid wasting engineering effort or creating resistance, I took the lead in writing initial tests for some of the more complex or obscure sections of the code. This helped show what good tests could look like and made it easier for others to follow. I also used TODO markers within the code to flag key functions that needed coverage as they were updated during feature work. Rather than enforcing testing through policy alone, I made a habit of using code reviews as an opportunity to reinforce quality practices—encouraging things like early returns, meaningful naming, modular design, and reusability.

2.4 Cleaning Up Config and Reducing Duplication

As work progressed, one persistent pain point was the handling of configuration values. Critical settings were hardcoded in multiple places, leading to duplication and the risk of inconsistency. To solve this, I wrote utility scripts that centralized config management into a single folder, making updates easier and safer. This drastically reduced context-switching for engineers and helped eliminate a common source of friction. Together, these efforts gave the team a cleaner, more predictable development experience—making it easier to deliver with confidence and iterate quickly.

3 Integration and delivery

3.1 Reworking the Branching Strategy

When I joined, all deployments were happening directly from the dev branch, which made it hard to manage stability or separate experimental changes from production-ready features. To restore control, I took the last known stable release branch, merged it into master, and then rebased master onto dev to realign the branches. Going forward, we used dev as a long-term integration space—a place for ongoing cleanup, experimentation, and quick tests—while master served as the canonical source for release-ready code. This branching model created clear boundaries between work in progress and what was considered deployable, which was a crucial step toward predictable delivery.

3.2 Enabling Local Development

One of the biggest blockers to delivery was the lack of a local development environment. Engineers were forced to deploy to shared environments just to test their code, meaning only one person could realistically test changes at a time. Since the system was built on a serverless architecture, the team hadn’t yet figured out how to simulate AWS Lambda behavior locally. To solve this, I built a lightweight Express server that mimicked the Lambda runtime. I wired up routes to invoke the existing Lambda handlers and moved all environment variables for staging and dev into gitignored .env files, using dotenv for local support. I also refactored the handlers to support dual execution—as both Lambda functions and Express route handlers. This allowed engineers to run and test features entirely offline. I documented this setup and added README steps, making it easy for anyone to spin it up and start testing immediately.

3.3 Expanding Testing Capacity with an Additional Environment

With dev stabilized and local testing unlocked, the next bottleneck was staging. QA typically validated features in this environment, but with multiple releases in play, it often became a single point of contention. To ease the pressure, I created an additional staging-like environment that mirrored the original setup. This provided a second testing lane, allowing the QA team to test features in parallel and helping us reduce wait times during release cycles. It was a simple change with immediate impact—engineers no longer had to wait for the “main” test environment to free up before validating their work.

3.4 Building Integration Testing from the Ground Up

Integration testing was completely absent when I arrived, which meant QA had to manually retest wide portions of the system for even minor changes. To fix this, I created a dedicated integration test repository. I made the tests compact and easy to run by embedding encrypted environment variables into the repo, so engineers could decrypt and run them out of the box. The test structure mirrored the system’s endpoint layout, making them easy to navigate and extend. To drive adoption, I began pairing integration test tickets with each feature or bug fix ticket, so tests could be written alongside product work. And anytime QA uncovered an issue, we didn’t just fix it—we wrote a test for it. This wasn’t easy to automate initially, but it steadily matured into a reliable system that reduced regression risk and increased deployment confidence across the board.

3.5 Automating the Pipeline and Parallelizing Tests

To bring all the pieces together, I turned my attention to the CI/CD pipeline, which needed significant cleanup to support the environments and workflows we were building. I streamlined the pipeline configuration to properly reflect all available environments and automated critical stages of the deployment process. I integrated Jira deployments, allowing us to track releases directly from our task board. I also ensured that unit tests would run on every commit and every new merge request, creating faster feedback loops and encouraging engineers to catch issues early.

As we started integrating end-to-end tests, we noticed a drop in regression issues—but also a slowdown in build times, especially since the system handled similar functionality across multiple partners. To address this, I parallelized the test suite by running tests separately per partner, each in its own job. This was done by checking out the repo in multiple runners, tagging test files with partner-specific annotations, and using Mocha to selectively run the right set of tests for each parallel job. The result was a dramatic reduction in test execution time and an overall increase in velocity.

Finally, I added dedicated pipeline jobs for different environments, as well as manual release triggers. This gave the team an abstracted, automated delivery flow, where engineers no longer had to manually intervene or piece together build steps. They simply pushed their code, opened a merge request, and the pipeline took care of the rest—only requiring clicks when human validation or release approvals were necessary.

4 Workflow

4.1 Requirements with Architecture in Mind

A major part of increasing delivery efficiency came from getting ahead of the work with clear requirements. I made it a point to collaborate closely with stakeholders early in the process—aligning on what needed to be built, freezing requirements where possible, and translating them into system architecture diagrams. These diagrams weren’t just for me—they became a visual communication tool to bounce ideas off other engineers and architects, ensuring the design made sense before we wrote a line of code. Once confident, I broke the requirements into Jira tickets, often with partial implementations or code snippets inside to give engineers a head start and illustrate what clean, modular implementation could look like. When necessary, I would even join the implementation directly, which helped move things faster and reduced context-switching across the team.

4.2 Streamlining Workflows with Jira Automation

To reduce time spent on task management and coordination, I introduced lightweight automations in Jira that aligned with how we actually worked. Our flow moved from TODO → In Progress → Review → QA → Testing → Done, and my scrum master configured Jira so that tickets automatically moved to “Review” and were assigned to me when a merge request was opened. This meant engineers could stay focused on the task itself, without having to manually update the ticket status or chase reviewers. It also helped me stay on top of what needed to be reviewed without delay.

4.3 Handling QA Feedback with Structured Ticketing

During QA testing, we often uncovered bugs or edge cases that weren’t initially accounted for. To manage this smoothly, we established a routine: categorize the issue, assess its impact, and take immediate action. If it was a functionality break, we created a bug ticket in the current sprint. If it was a newly discovered edge case, we’d revisit the requirements, update them if needed, and either add a ticket to the sprint or backlog. For architectural improvements or design gaps, we created spike issues that I usually handled personally. This workflow ensured that feedback loops were tight and transparent—and most importantly, nothing fell through the cracks.

4.4 Evolving into Automated Release Branching

As we matured, we moved toward a release branching strategy, but I wanted to validate whether it fit the team’s workflow before enforcing it. So, for about 10–40 releases, we did it manually—tracking how the team responded and whether it introduced friction. Once I saw the team was comfortable, I automated the entire release flow using a small serverless function. This script was triggered each time I created a release in Jira and handled the branching logic end-to-end. Automating this step eliminated manual effort, removed the chance of errors, and further increased velocity by streamlining how we shipped code.

5 Enhancing the eagle's eye

5.1 The Problem: Limited Visibility and High Debugging Cost

Before we had proper observability, investigating issues in the system was a time-consuming process. Engineers often had to manually query databases or comb through log streams just to gather basic information. There was no easy way to trace a user's history, understand recent changes, or view how a partner integration behaved at a specific point in time. Even something as essential as subscription change history didn’t exist, which made debugging regressions or investigating edge cases particularly difficult. This lack of visibility slowed us down in moments when speed and clarity were most needed.

5.2 The Solution: Observability Dashboard and Data Aggregation

To solve this, I built a custom observability backend and an internal only dashboard that consolidated the most critical system and user data in one place. At a high level, it provided an overview of all partner-related activity, including:

Total subscriptions per partner
Sales channel distribution
Monthly subscription trends and breakdowns

It also gave stakeholders powerful tools for user-level investigation. By searching a user, they could instantly access:

Identity (minus sensitive info)
Device and eligibility details
Subscription status and full change history
All push notifications triggered by the CRM

This dramatically reduced the turnaround time for debugging and helped teams get to the root of issues without needing backend support or deep system access.

5.3 Impact: Faster Resolution and Strategic Insight

In addition to reducing debugging overhead, the dashboard became a valuable source of insight for product managers and upper leadership. It helped them monitor subscription growth across partners, identify patterns in user behavior, and assess the effectiveness of CRM events and entitlements. The real-time event tracking view made it easier to confirm whether user actions had triggered expected flows—or pinpoint where something had silently failed. What started as a tool for engineering observability quickly became a shared knowledge surface across teams, enabling faster collaboration, smarter decisions, and a stronger sense of control over a complex ecosystem.

6 Alerting System

6.1 The Challenge: No Central Alerting System

At the time I joined, the system had no unified alerting mechanism, and critical issues often went unnoticed until they became user-facing or required manual inspection. There was no structured way to monitor key system failures or event anomalies, which made it difficult to respond quickly in moments that required urgent action. The absence of real-time visibility into failures not only delayed incident resolution but also made the system feel opaque for both engineers and stakeholders.

6.2 The Solution: Centralized, Reusable Notification System

To address this, I built a centralized error logging and alerting mechanism, powered by AWS SNS. Critical system errors and high-priority events were published to a single topic with filtered subscribers—allowing me to fan out alerts to various consumers (emails, logs, dashboards, etc.) without duplicating logic or tightly coupling components. This architecture ensured the system remained modular and reusable, enabling new subscribers to plug into alert streams effortlessly. More importantly, it gave key stakeholders real-time visibility into what was happening, so they could respond to incidents faster and with context.

6.3 User-Facing Notification View

To make alerts even more accessible, I also built a notifications view directly into the dashboard, giving users the option to opt in or out of in-app alerts. This view allowed team members and stakeholders to see critical system activity and messages without relying solely on email, creating a more intuitive and centralized experience. By surfacing this information in a user-friendly way, we gave everyone—engineers, QA, product leads—a shared awareness of system health, directly within the tools they already used day-to-day.

7 The scheduler

7.1 The Problem: Scattered, Rigid Async Logic

When I joined, there were already some solutions in place for handling asynchronous activity, but they were tightly coupled to specific actions—like subscription renewals or email notifications. While these worked in isolation, they weren’t scalable. If a new async task needed to be introduced—say, for downgrading a subscription or sending reminders—a brand new solution had to be built from scratch. For a middleware team expected to handle a wide range of integrations and business workflows, this wasn’t sustainable. We needed something that could adapt with us.

7.2 The Solution: Designing a Scalable, Generic Scheduler

To solve this, I took the initiative to design and implement a reusable, extensible scheduling service—a topic I cover in more detail in another article. This new scheduler was built to be action-agnostic. Any asynchronous activity could be represented as a scheduled "action" with its own configuration: execution time, repeat logic, and stop condition. It now handles everything from subscription terminations and downgrades to reminders and notifications—all in a single system. On top of that, I integrated it with our dashboard so we could get hourly visibility into scheduled activity, giving us a strong signal on system health and operational progress. This wasn’t just a technical upgrade—it gave us clarity and control.

7.3 The Impact: A Reusable Core That Keeps Improving

This scheduler became a central piece of our architecture, dramatically reducing the time needed to implement new async flows. Instead of reinventing the wheel, all we needed to do was schedule a new action or extend the fulfillment logic. It became a flexible engine that directly improved our delivery speed, because it removed the need for repetitive, boilerplate async infrastructure. That said, the journey wasn’t without challenges. We faced (and still refine) issues around load handling, concurrency, rate limiting, and deduplication. But the difference now is that we’re improving a single, unified system—not stitching together new ones with every requirement. The scheduler turned a scattered pattern into a strategic capability.

The Challenges

Understanding a Complex System Without a Map

One of the toughest parts of stepping into this role was grasping the entire system end-to-end—not because the business model was deeply complex, but because the supporting documentation was sparse or outdated. There were gaps between what the system was supposed to do and what the code actually did. That disconnect made onboarding harder than it needed to be. I’ve always believed that the best documentation is often the code itself, but that only works when the code is readable, modular, and semantically meaningful. In this case, I was dealing with a codebase that had accumulated poor naming conventions, logic sprawl, and limited structure, which made it feel like I was reverse-engineering behavior instead of working with an intentionally designed system. It took me a while to mentally map how each function connected to a business workflow, and often I had to rely on multiple sources—logs, QA inputs, and even trial-and-error debugging—to fully understand the purpose of certain components. That cognitive overhead slowed me down initially and made early decisions riskier than I liked.

Balancing Leadership, Reviews, and Individual Contribution

Another major challenge was balancing technical leadership with hands-on contribution. To move the transformation forward at a sustainable pace, I had to go beyond guiding the work—I had to get involved in the work. I love writing code, and during the early stages of cleanup, I was actively implementing changes, setting up tools, fixing tests, and writing automation. But that came with a cost. As a lead, I was also pulled into multiple meetings—syncs with stakeholders, platform discussions, issue triage, and architecture planning. Add to that the weight of code reviews, planning sessions, and mentoring, and it became increasingly difficult to manage my time. While it was rewarding to stay hands-on, it required constant context switching and discipline to ensure I wasn’t bottlenecking others or burning out myself. I had to build boundaries around deep work time and become more intentional about prioritizing leadership tasks without losing my engineering rhythm.

Driving Cultural Change Through Code Reviews

Introducing cultural change is never instant—especially when it comes to engineering discipline and code quality expectations. Early on, many of our review sessions turned into mini workshops, where I’d explain why we needed early returns, how to name things clearly, or why separating concerns was critical for reusability. While the team was wonderfully teachable and open-minded, these sessions often made reviews longer and more involved. It wasn’t just about green checks—it was about transferring thinking patterns and reshaping habits. I didn’t want to enforce standards through silence or bureaucracy; I wanted to help the team see the “why” behind each change. Over time, this started to stick—engineers began reflecting those practices in their pull requests, asking better questions, and thinking more critically about structure. But the emotional and cognitive load of being both a gatekeeper and a teacher was something I had to carry consistently, and it’s one of the less visible but most persistent challenges in trying to build a better culture.

Conclusion: From Foundation to Flow

Looking back, this journey wasn’t just about speeding up deployments or cleaning up a codebase—it was about building clarity, culture, and confidence into a system and a team that were already doing their best with what they had. When I joined, the signs of potential were everywhere: a team that cared, a product with purpose, and a system that—despite its complexity—had survived real-world pressure. But to go from surviving to thriving, we had to be intentional. We had to understand what was slowing us down, challenge it at its roots, and rebuild with scale and sustainability in mind.

The improvements didn’t happen overnight. From setting up local development environments and refactoring legacy code, to establishing CI/CD pipelines and writing test coverage policies, every step required focus, patience, and a willingness to collaborate. We untangled infrastructure, designed reusable patterns, centralized configurations, and introduced observability tools that brought transparency to everyone—from engineers to product leads. We shifted away from reactive firefighting to proactive design, and began using data to guide our decisions, track our growth, and prove our value.

But perhaps the most meaningful transformation wasn’t in the code—it was in the team. We evolved how we work together. Engineers became more confident, more consistent, and more aware of their impact. QA gained tools to test smarter and faster. Stakeholders got visibility into what's really happening. And as for me, I got to witness the kind of change that can only happen when people trust the process and commit to the long haul.

There’s still more to do. There always will be. But the foundation is solid now, and the flow has begun. We’re no longer just delivering—we’re delivering well, and that’s the kind of velocity that matters most.