DEV Community: Karel Vanden Bussche

Agent Memory Is Just a Database You Forgot to Index

Karel Vanden Bussche — Tue, 24 Mar 2026 07:47:45 +0000

In my previous article, I explored why personas still matter when working with AI agents. Distinct perspectives shape output in ways that raw context alone cannot replicate. But I also raised a limitation that I want to address head-on: every fresh context window starts from zero. The persona needs to rebuild its understanding of your codebase from scratch, every single time.

This is not just an inconvenience. It is a structural problem. If every session begins with a blank slate, your agents spend tokens re-discovering things they already knew. They scan files they have already read. They infer conventions they have already been told about. And the more complex your codebase becomes, the worse this gets. It does not scale.

So how do you give an agent persistent understanding without drowning it in context?

Let me start with an analogy. Think about how you would use an encyclopedia. You would not read it cover to cover to find a single fact. You would go to the index, look up your term, and jump to the right page. If the exact term is not listed, you scan nearby entries for related concepts and try again. The index does not contain the knowledge itself. It tells you where to find it. And that distinction, between holding all the information and knowing where to look, is exactly the problem most people get wrong when setting up memory for AI agents.

This article is about treating agent memory the way you would treat a database. Not as a dumping ground for everything the agent might need, but as a structured, queryable system where the right information is retrievable at the right time. Because the difference between an agent that spins its wheels and one that moves with precision is rarely intelligence. It is navigation.

The Problem: Context Pollution

The instinct most people have when working with AI agents is to give them more. More documentation. More files. More history. The reasoning is straightforward: if the model has access to everything, it should be able to figure out what is relevant. And on the surface, this makes sense. More data, better decisions.

But this is where the database analogy already starts to apply. Imagine a database with no indexes, no query optimisation, and no schema. You dump every record into a single table and hope that the right rows surface when you need them. Technically, all the information is there. Practically, your queries are slow, your results are noisy, and the system wastes most of its resources sifting through data that is not relevant to the question at hand.

This is exactly what happens when you overload an agent's context. The model does not magically zero in on the three files that matter. It processes everything you gave it, weighing each piece of information against the task. The more noise you introduce, the harder it becomes for the model to identify the signal. I already hear you thinking: "But the model is smart enough to filter." It is, to a degree. But filtering has a cost. Every irrelevant paragraph the model needs to evaluate is a paragraph that competes for attention with the relevant ones.

In machine learning terms, you are overfitting the agent on context. You give it so many data points that it starts treating noise as signal. The result is not a smarter agent. It is an agent that hedges, that qualifies every statement, that tries to account for information it should never have been given in the first place. The output becomes generic precisely because the input was too broad.

The opposite extreme is equally problematic. Give an agent no persistent context at all, and it starts every session as a stranger to your codebase. It asks questions you have already answered. It makes assumptions that contradict decisions you made weeks ago. It produces code that is technically correct but stylistically inconsistent with everything around it. You end up spending more time correcting the agent than you would have spent writing the code yourself.

The sweet spot is somewhere in the middle. Not everything, not nothing. Structured, relevant, navigable context that gives the agent exactly what it needs to orient itself without burying it in information it does not need. The question is how you build that. A bit like a codebase Google for your AI.

Memory as a Loosely Coupled Database

If you take a step back and look at what agent memory actually does, it maps surprisingly well to database concepts:

Storage — where the information lives
Retrieval — how the agent finds what it needs
Structure — how the information is organised so that retrieval is efficient
Scope — what gets stored in the first place and what gets left out

The reason we need this in the first place is practical: you cannot yet fit an entire codebase into the context window of an LLM. But even if you could, it would not solve the problem. As we covered in the previous section, more context does not mean better context. An agent with your entire codebase in its window would still suffer from context pollution in any non-trivial project. The signal-to-noise ratio would make it harder, not easier, to reason about a specific task.

In a traditional database, you would not store every possible piece of data in a single unstructured blob. You would design a schema. You would define tables, relationships, and indexes. You would think about what queries you need to support and optimise for those access patterns. The schema is not overhead. It is what makes the database useful.

Agent memory works the same way. The "database" is the collection of files your agent can access. The "schema" is how those files are structured and named. The "indexes" are the reference documents that tell the agent where to look. And the "queries" are the prompts and instructions that trigger the agent to retrieve specific information.

The difference is that agent memory does not need the rigidity of a relational database. You are not writing SQL. You do not need foreign keys or normalisation. What you need is semi-structured data. Enough structure to be navigable, enough flexibility to evolve as your project changes.

This is where Markdown files become the natural fit. They are human-readable, which means you can maintain them without tooling. They are parseable, which means agents can extract information from them reliably. And because they are written in natural language, they play directly into how large language models work. At their core, these models are word predictors. Semi-structured text in a natural language format is exactly the kind of input they are optimised to reason over. A JSON schema might be more precise, but a well-written Markdown file hits the sweet spot between machine-parseable and model-friendly. On top of that, they support just enough structure through headings, lists, and front matter to organise information without imposing a rigid schema.

Let me name a definitely non-exhaustive list of what this looks like in practice:

Project context files — what the project is, what the key conventions are, what decisions have been made. This is the equivalent of your database's master reference table.
Index files — pointers to where specific information lives. Not the information itself, just the location. Like a database index that maps a key to a row.
Glossary files — domain-specific terms and their definitions. This eliminates ambiguity that would otherwise cost the agent tokens to infer or guess.
Decision logs — why certain architectural or design choices were made. This prevents the agent from re-litigating decisions that have already been settled.

None of these files need to be large. A good index file might be 30 lines. A glossary might be 20 entries. The value is not in the volume of what you store. It is in the precision of what you make retrievable.

As such, agent memory is not about giving the model a photographic memory of your entire codebase. It is about building a lightweight, queryable layer that lets the agent navigate with intent rather than brute force.

The Abstraction Layer

If you have worked with any modern web framework, you have probably used an ORM. An Object-Relational Mapper sits between your application code and the database. You do not write raw SQL for every query. Instead, you interact with objects and methods that translate your intent into database operations. The ORM abstracts away the complexity of the underlying system so you can focus on what you are trying to do, not how the storage engine works.

Agent memory follows the same pattern. The "database" is your codebase, your file system, your project history. The raw data is all there. But an agent interacting with that raw data directly is like writing raw SQL for every operation. It works, but it is slow, error-prone, and forces the agent to deal with complexity that has nothing to do with the task at hand.

The memory layer, your Markdown files, your indexes, your glossaries, acts as the ORM. It provides a clean interface between the agent and the underlying complexity. Instead of scanning every file in src/ to understand what the project does, the agent reads a reference document that tells it. Instead of inferring what "DRS" means from context clues scattered across twelve files, it checks the glossary. Instead of guessing which module handles authentication, it looks it up in the index.

This is not just a convenience. It is a fundamental shift in how the agent spends its tokens. Without the abstraction layer, the agent's cognitive budget goes toward navigation: finding files, inferring structure, building a mental model from scratch. With the abstraction layer, that budget goes toward the actual task: writing code, reviewing architecture, spotting bugs.

Put it this way. Every token your agent spends figuring out where things are is a token it is not spending on figuring out what to do with them. The abstraction layer is what shifts the balance from navigation to execution.

It might sound like a weird place to apply the ORM analogy, but if you think about it, it really isn't. An ORM does not make the database simpler. The database is still there, with all its complexity. The ORM just means your application code does not need to care about that complexity. Agent memory does the same thing. Your codebase is still complex. The memory layer just means your agent does not need to rediscover that complexity every time it starts a new session.

Index Files: The Encyclopedia Pattern

Well, let's take a deeper dive into what these reference files actually look like. Because the concept is only useful if you can implement it in five minutes and start seeing results immediately.

The simplest and most effective pattern is the index file. Think of it as the table of contents for your codebase. It does not explain how things work. It tells the agent where things are and what they do in one line. The agent reads this file first, orients itself, and then navigates directly to the relevant source files.

Here is what a realistic reference file looks like:

# REFERENCE.md — Project Index

## Core Scripts

| Script | Purpose |
|--------|---------|
| `src/api/server.py` | API entry point, initialises routes and middleware |
| `src/workers/processor.py` | Background job processor, handles async task queue |
| `src/services/billing.py` | Stripe integration, subscription lifecycle management |
| `src/services/notifications.py` | Multi-channel notification dispatch (email, Slack, webhook) |
| `src/auth/middleware.py` | JWT validation, role-based access control |
| `scripts/migrate.sh` | Database migration runner with rollback support |

## Configuration

| File | Purpose |
|------|---------|
| `config/settings.yaml` | Environment-specific app configuration |
| `config/permissions.yaml` | Role definitions and resource-level access rules |
| `.env` | API keys and secrets (not committed) |

## Key Directories

| Directory | What lives here |
|-----------|----------------|
| `src/services/` | Business logic, one module per domain |
| `src/api/routes/` | Route handlers grouped by resource |
| `src/workers/` | Background jobs and scheduled tasks |
| `migrations/` | Database migration files, ordered by timestamp |

The agent does not need to traverse src/ recursively to understand the project. It reads this file, finds the relevant entry, and jumps straight to the source. If the task is about billing, it goes to src/services/billing.py and config/settings.yaml. If the task is about access control, it starts in src/auth/middleware.py and config/permissions.yaml. No guessing. No scanning.

Now consider what happens when your project uses domain-specific terminology. Every codebase has its own vocabulary. Abbreviations, internal names, concepts that mean something very specific in your context but something different in general usage. An agent encountering these for the first time has two options: infer the meaning from surrounding code, which costs tokens and risks getting it wrong, or check a glossary.

# GLOSSARY.md — Domain Terms

| Term | Definition |
|------|-----------|
| **Tenant** | An isolated customer workspace. All data is scoped to a tenant. Multi-tenancy is enforced at the database and API layer. |
| **Seat** | A billable user within a tenant. Billing is per-seat, tiered by plan. |
| **Webhook relay** | The internal system that fans out events to customer-registered webhook URLs. Retries with exponential backoff. |
| **Migration** | A versioned, ordered database schema change. Migrations are forward-only in production; rollbacks exist for staging only. |
| **Job** | A unit of async work processed by the worker pool. Jobs are enqueued via Redis and processed FIFO per priority tier. |
| **Feature flag** | A runtime toggle for enabling or disabling functionality per tenant. Managed via the admin API, cached locally. |

This is six entries. It takes an agent a fraction of a second to read. But the impact is significant. Without this file, the agent encounters "relay" in the code and has to decide: does this mean a network relay, an event relay, or something else entirely? With the glossary, there is no ambiguity. The term is defined once, in one place, and the agent can move on.

The beauty of this pattern is that even when the index does not contain exactly what the agent is looking for, it still provides value. Scanning the reference file exposes related terms, nearby modules, and structural patterns that narrow the search. It is the same experience as flipping through an encyclopedia index. You might not find "event dispatch" listed, but you see "webhook relay" and "notifications" and you are already closer to what you need. The index turns a blind search into a guided one.

Shared Memory, Individual Perspective

In my previous article, I described how personas activate different frames of reference within the same model. A security engineer and a backend developer look at the same endpoint but see different things. That mechanism works well for a single task. But what happens when multiple personas need to collaborate on a larger piece of work?

In a real team, this is solved by shared context. Everyone has access to the same codebase, the same documentation, the same project board. That shared layer is what keeps people aligned. But each team member still brings their own expertise, their own priorities, their own mental model. The shared context does not make everyone think the same way. It gives them a common starting point from which to diverge productively.

Agent personas work the same way. The reference files, the glossary, the decision log, these form the shared memory layer. Every persona reads from the same source of truth. The security engineer knows what "tenant" means because it is in the glossary. The architect knows where the billing logic lives because it is in the index. The integration engineer knows which API versioning strategy was chosen because it is in the decision log. None of them need to rediscover this information. It is already there, structured and accessible.

But here is where it gets interesting. Even though all personas share the same memory layer, they do not use it the same way. The security engineer reads config/permissions.yaml and thinks about privilege escalation. The backend developer reads it and thinks about query performance on role lookups. The architect reads it and thinks about whether the permission model will scale when you add a new resource type. Same file, same information, different perspective. The shared memory does not flatten the personas into one. It frees them to focus on what they do best.

This is also where structured text becomes critical as a communication protocol between personas. When one persona produces output that another needs to build on, that handoff needs to be clean. If the architect writes a design decision as unstructured prose buried in a long conversation, the developer picking it up has to parse intent from paragraphs. But if the architect writes it as a structured entry in a decision log, with the context, the options considered, and the rationale, the developer can consume it in seconds and move on.

## [DECIDED] 2026-03-12 — Authentication strategy

**Context:** Need to support both API key auth for integrations
and session-based auth for the dashboard.

**Options:** (A) Unified middleware handling both, (B) Separate
middleware per auth type, (C) Gateway-level routing to different
auth handlers.

**Decision:** Option B — separate middleware.

**Reason:** Keeps each auth flow independently testable and
avoids conditional branching that grows with every new auth
method. Gateway routing (C) was overkill for current scale.

This is not just documentation. It is agent-to-agent communication. The architect persona produced this. The developer persona consumes it. The security persona reviews it. Each one gets exactly what they need from a structured, predictable format. No ambiguity, no re-interpretation, no tokens wasted on figuring out what was decided and why.

Taking this into account, the pattern becomes clear. Shared memory is the common ground that keeps all personas aligned. Structured text is the protocol that lets them hand off work cleanly. And the individual perspective of each persona is what turns that shared information into distinct, valuable output. You do not need complex orchestration frameworks to make agents collaborate. You need well-structured files and clearly defined roles.

The complexity of the communication goes down so the complexity of the work can go up.

Memory as Optimisation, Not Requirement

Let me be clear about something. Agents do not need memory to work. You can start a fresh session, point an agent at your codebase, and it will figure things out. It will scan files, infer conventions, piece together how the system works. It is not stupid. It will get there eventually.

But "eventually" has a cost. Every file the agent reads to orient itself is tokens spent on navigation instead of execution. Every convention it infers is a guess that might be wrong. Every decision it re-discovers is a decision that was already made and documented nowhere. The agent works, but it works harder than it needs to.

Memory changes that equation. Not by making the agent smarter, but by making its environment more navigable. An index file does not add intelligence. It removes friction. A glossary does not teach the agent new concepts. It eliminates ambiguity. A decision log does not improve reasoning. It prevents redundant reasoning.

This is why I frame memory as an optimisation, not a requirement. It is the same distinction as adding an index to a database. The database works without it. Queries return the right results. But they return them slower, with more resources consumed, and with less predictable performance as the dataset grows. The index does not change what the database can do. It changes how efficiently it does it.

For AI agents, the benefits compound across three dimensions:

Efficiency — the agent reaches the relevant code faster because it knows where to look, not because it searched everywhere
Token cost — fewer tokens spent on discovery means more tokens available for the actual task, which directly translates to lower cost and better output quality
Context clarity — structured, scoped input reduces the chance of context pollution, which means the agent's output stays focused and relevant

None of this is revolutionary. It is the same principle that makes well-organised documentation valuable for human developers. A new team member with access to a good README, a clear project structure, and an up-to-date architecture document ramps up faster than one who has to read every file and ask questions in Slack. The mechanism is identical. The only difference is that your agent joins the team fresh every single session.

Of course, the real question is not whether you should structure your agent's memory. It is how you maintain that structure as your project evolves. Index files that fall out of sync with the codebase are worse than no index at all, because they introduce confident misinformation. This is a maintenance commitment, not a one-time setup. But if you are already maintaining documentation for your human team, extending it to serve your agents is a marginal effort with outsized returns.

This article covered the retrieval layer: how agents navigate and access the knowledge they need. But there is a related question that I have only touched on here. When multiple agents work together on a larger task, how do they coordinate? How does the output of one agent become the input of another without losing context or creating conflicts? That is the orchestration layer, and it deserves its own deep dive.

For now, start simple. Write an index file for your project, or do it together with your favourite AI friend. Add a glossary for your domain terms. Structure your decisions in a log. These are small files that take minutes to create. But they change the starting position of every agent session from "figure out where everything is" to "here is what you need, now get to work."

That shift alone is worth more than a larger context window.

Why Personas Still Matter in the Age of AI

Karel Vanden Bussche — Thu, 19 Mar 2026 07:49:30 +0000

In my previous article, I explored the idea that managing AI is not fundamentally different from managing a team. You scope work, assign it to personas, verify the output through quality gates, and iterate. The structure is identical to how engineering organisations have operated for years. The only thing that changed is who picks up the ticket.

But that article focused on the mechanics. The workstreams, the quality gates, the feedback loops. It answered the question of how you manage AI as a workforce. What it did not address is a more fundamental question: why would you bother creating distinct personas in the first place?

I already hear you thinking: "Context windows are getting bigger every month. Models are getting smarter. Won't all of this become unnecessary?" It is a fair question. And the answer, at least today, is no. Personas are not a workaround for limited AI. They are a way of encoding something that raw context alone cannot provide: perspective.

The Context Window Problem

Let me start with the practical reality. Every AI model operates within a context window. This is the amount of information it can hold and reason about at any given time. We are rapidly increasing the size of these windows. What used to be a few thousand tokens is now hundreds of thousands, and it will keep growing.

So why not just dump everything into context and let the model figure it out?

Because perspective is not the same as information. Think about what makes a senior engineer effective. It is not that they have memorised more documentation than a junior. It is that they have developed a mental model, a frame of reference built from years of experience, that tells them what to pay attention to and what to ignore. That frame of reference is implicit. It shapes every decision they make, from how they structure a module to which edge cases they instinctively check for.

Now think about how many tokens it would take to fully encode that perspective. Not just the knowledge, but the priorities, the biases, the heuristics, the "I have seen this pattern fail before" instincts. You would need an enormous amount of structured context to replicate what a single persona label activates in a model.

This is the key insight. When you assign a persona to an AI agent, you are not just giving it a role. You are activating a frame of reference that already exists in the model's training data. The persona acts as a filter, a set of starting weights that shapes how the model interprets and responds to everything in its context. It is compression. You get the equivalent of thousands of tokens of implicit knowledge from a single well-defined persona.

As such, even as context windows grow, personas remain efficient. They are not a patch for limited memory. They are a fundamentally different mechanism for injecting perspective.

Perspective Shapes Output

Here is where it gets practical. The same task, given to the same model, produces meaningfully different output depending on which persona you assign.

Take something straightforward: building an API endpoint. If I approach this as an integration engineer, my first concerns are the contract, the versioning strategy, the error response format, and how downstream consumers will interact with it. I think about backwards compatibility, rate limiting, and what happens when a third party calls this endpoint with unexpected input.

If a backend developer builds the same endpoint, they look at very different things. The data model, the query performance, the transaction boundaries, the service layer architecture. Both are building the same endpoint. Both produce working code. But the shape of that code is different because the lens through which they evaluate "good" is different.

This is experience encoded as perspective. In the real world, you get this naturally from having different people on your team. The frontend engineer catches the UX issue that the backend developer would never notice. The security reviewer spots the vulnerability that the feature developer did not consider. Each person's experience shapes what they see and what they miss.

You are a senior Integration Engineer with 10+ years of experience
designing APIs consumed by external teams and third-party systems.

When reviewing or building an API endpoint, your primary concerns are:
- Contract stability and versioning strategy
- Structured, documented error responses
- Backwards compatibility for existing consumers
- Rate limiting and throttling for external callers
- Idempotency guarantees for safe retries

You evaluate "good" through the lens of the downstream consumer.
A well-built endpoint is one that is predictable, resilient to misuse,
and does not break when you ship a new version.

You are a senior Backend Developer with deep expertise in service
architecture, data modeling, and system performance.

When reviewing or building an API endpoint, your primary concerns are:
- Query performance and database access patterns
- Transaction boundaries and data consistency
- Adherence to existing service layer conventions
- Correct mapping of domain exceptions to responses
- Test coverage across edge cases and failure modes

You evaluate "good" through the lens of internal system health.
A well-built endpoint is one that is performant, maintainable,
and consistent with the rest of the codebase.

Same endpoint. Two personas. Two fundamentally different review priorities. Neither is wrong. Both are valid. But if you only apply one lens, you get blind spots. When you assign a persona, you are choosing which lens the model applies. And when you assign multiple personas to the same task, you get coverage that no single perspective could provide.

This is closely related to how a real team operates. If you read my previous article, this should feel familiar. The difference is that you are not waiting for five people to find time on their calendar. You are running five perspectives in parallel, each contributing their angle within minutes.

The Council Pattern

So you have multiple personas, each with their own perspective. The next logical step is to put them in a room together.

Think about how decisions get made in a real engineering team. You do not ask one person to design the solution in isolation and hope for the best. You bring people together. A tech lead, a frontend engineer, a security reviewer, maybe someone from ops. Each person looks at the problem through their own lens. They challenge each other. They surface trade-offs the others missed. The output of that meeting is not perfect, but it is significantly closer to the optimal solution than any single person would have produced alone.

This is exactly what a persona council does. You define the problem, assign multiple personas to evaluate it, and let their perspectives collide. The architect flags that the proposed data model will not scale past a certain threshold. The security engineer points out that the authentication flow has a gap. The frontend developer raises that the API response shape will require three extra transformations on the client side. Each concern is valid. None of them would have surfaced if you had only asked one persona.

The important nuance here is what happens after the council. You still need a tech lead. The personas generate perspectives, not decisions. They surface the dimensions you need to consider, but the final call is yours. There are always things to iron out, trade-offs to weigh, constraints that only you know about. The council does not replace your judgement. It informs it.

In most cases, what comes out of a council session is already 80% of the way to the right solution. The remaining 20% is where your experience as the human in the loop adds the most value. You are not starting from scratch. You are refining a solution that has already been stress-tested from multiple angles. That is a fundamentally different starting position than staring at a blank screen and trying to think of everything yourself.

Put it this way. Running a council is like hosting a workshop with your team. Except the team shows up instantly, nobody needs to context-switch from another project, and you can run the workshop as many times as you need until the output meets your bar.

Filling Your Own Gaps

Here is where personas become personally valuable, beyond just improving code quality. They help you operate in domains where you have no experience at all.

I am not a frontend designer. I am not a marketing strategist. I have never run an enterprise sales cycle. But every product needs all of these things. If you are a solo builder or a small team, you cannot hire a specialist for every domain. You either learn it yourself, which takes months, or you skip it entirely and hope it does not matter. In most cases, it does matter.

This is where personas shift from a productivity tool to something closer to a virtual advisory board. When I need to think about brand positioning, I do not pretend to be a brand strategist. I activate a persona that thinks like one. When I need to evaluate whether my onboarding flow makes sense, I bring in a customer success persona. When I need to understand how an enterprise buyer evaluates software, I consult a persona with that specific lens.

It might sound like a weird place to apply this, but if you think about it, it really isn't. This is exactly what a CEO does when they build a leadership team. They do not try to be the expert in every domain. They surround themselves with people whose expertise complements their own, and they trust those people to flag what they cannot see. The dynamic is the same. The difference is that your "team" exists inside your development environment.

You are a Customer Success Lead with experience onboarding
technical products for non-technical users.

Evaluate this onboarding flow from the perspective of a first-time
user who found us through a blog post. They are technical enough
to use an API, but they have never seen this product before.

Flag any step where:
- The user needs context they do not yet have
- The cognitive load is too high for a first interaction
- The value proposition is unclear or delayed
- The user might abandon because the next step is not obvious

I have used this pattern extensively while building Indiebase. There are entire layers of the product, the positioning, the information architecture, the pricing model, that I would not have been able to think through properly on my own. Not because I lack intelligence, but because I lack the specific experience that shapes good decisions in those domains.

The personas do not replace that experience entirely. They approximate it. And that approximation, combined with my own judgement and multiple iterations, gets me to a solution that is significantly better than what I would have produced alone. It is not the same as having a real expert on your team. But it is dramatically better than having no perspective at all.

Where It Breaks Down

I want to be honest about the limitations, because overselling this would do you a disservice.

Personas are not yet at the level of truly creative, experience-driven thinking. A real senior architect does not just apply a checklist. They pattern-match against hundreds of systems they have seen fail. They have gut feelings about complexity that come from years of debugging production incidents at 3 AM. That intuition, the kind that makes someone say "this feels wrong" before they can articulate why, is not something a persona reliably replicates.

There is a more fundamental limitation too. A real team member builds up experience within your codebase over time. They learn the quirks, the workarounds, the reasons behind odd decisions. They develop mental models specific to your system that inform every review and every suggestion they make. Personas do not have this. Every fresh context window starts from zero. The persona needs to rebuild its understanding of your codebase from scratch, every single time.

This means that even with large context windows, a persona cannot fully grasp the intricacies that a human team member would absorb over months of working in the same repository. It can read the code. It can infer patterns. But it does not carry the accumulated understanding of why that one service is structured differently, or why that particular abstraction exists despite looking over-engineered. That kind of codebase-specific experience is something personas still cannot retain between sessions.

What I have found in practice is that personas are excellent at breadth but inconsistent at depth. They will surface ten dimensions you should consider. They will catch the obvious gaps. They will give you a structured starting point that is better than nothing. But the tenth iteration of a nuanced architectural decision still requires your input. You need to push back, ask follow-up questions, reject the first answer, and guide the persona toward the specific context that makes your situation unique.

This is not a one-shot process. It is interactive. The value does not come from asking a persona once and accepting the output. It comes from the back-and-forth, the iterative refinement where your domain knowledge meets the persona's structured perspective. Each round gets closer to something you could not have reached alone.

Taking this into account, I want to share what this looked like in practice while building Indiebase. There were moments where a persona council gave me an architecture that looked clean on paper but missed a critical constraint specific to my use case. There were marketing recommendations that sounded textbook-correct but did not fit the audience I was targeting. Every time, the fix was the same: iterate with my own input until the output reflected reality, not just theory.

But here is the part that genuinely surprised me. Even when the initial output was off, the process of engaging with these perspectives opened my eyes to possibilities I would never have considered on my own. A security persona flagged a threat model I had not thought about. A product persona suggested a pricing structure I would have dismissed if I had been thinking in isolation. The value was not always in the answer. It was in the question the persona forced me to confront.

Is it perfect? No. Is it already at the level where you can fully delegate creative, experience-driven decisions? Not yet. But it is a tool that, used interactively and iteratively, meaningfully expands what a single person can build.

The Bridge You Build With

Personas are not a hack. They are not a prompt engineering trick that will become obsolete next year. They are a practical mechanism for injecting structured perspective into a system that does not natively retain it.

As context windows grow and models develop longer memory, some of what personas provide today will become built-in. The codebase familiarity gap will shrink. The need to explicitly define "think like a security engineer" may fade as models learn to apply multiple lenses automatically. But we are not there yet. And until we are, personas remain one of the most effective tools available to a solo builder or a small team that needs to punch above its weight.

Of course, the real question is not whether personas work. It is how you put them to work in a system that compounds their value. The articles in this series have covered the management model, the perspective layer, and the practical limitations. What ties all of it together is velocity. The ability to move from an idea to a validated solution to working software without waiting for a team to assemble, a meeting to be scheduled, or a hire to ramp up.

This is exactly what I am building with Indiebase. A platform designed to compress the distance between idea, validation, and execution. Because the bottleneck for most builders is not talent or tooling. It is the friction between knowing what to build and actually shipping it. If that resonates, take a look. I am building it in public, using every pattern described in this series.

The personas got me here. They will get you further than you think.

Managing AI Or Managing a Team?

Karel Vanden Bussche — Sat, 14 Mar 2026 08:40:56 +0000

Most people working with AI treat it like a souped-up search bar. You type in a question, you get an answer, you move on. Some have progressed to the "fancy autocomplete" stage, letting it fill in code while they hover over the accept button. Both are valid use cases, but they barely scratch the surface of what becomes possible when you shift your mental model entirely.

The better way to think about AI, particularly in a software engineering context, is as a team of personas. Not a single assistant, but a set of specialists you assign to workstreams and manage through quality gates. A developer persona, an architect persona, a QA persona, a security reviewer. Each scoped to a concern, each operating within boundaries you define.

If that sounds familiar, it should. It is how most engineering organisations already work.

The Forcing Problem

Let me start with the most common mistake I see people make when working with AI in a development context. They try to force their opinions onto the code. Line by line, function by function, they dictate exactly what the output should look like. The AI becomes a transcription machine for thoughts you already had.

I already hear you thinking: "But, I know the codebase better than the AI does." Of course you do. But knowing the codebase does not mean you need to write every line yourself. Think about how you work with a colleague. You would not sit behind their shoulder and dictate every variable name. You would describe the problem, point them to the relevant context, and let them figure out the implementation details.

The same principle applies to AI. When you micromanage the output, you are essentially paying the cost of writing the code yourself, with the added overhead of translating your intent into prompts. You are operating at the wrong abstraction level. The value of AI is not that it types faster than you. The value is that it can handle the lower level abstractions of the code while you focus on the decisions that actually matter: architecture, constraints, and quality.

The research also does not lie. An AI agent can already tackle work that a normal person would take 8 to 16 hours. That's about 2 working days of output while you are looking at the friendly "Frolicking..." prompt your Claude outputs for a couple of minutes. This is a different packet of work than we normally think about.

Put it this way. If you find yourself rewriting most of what the AI produces, the problem is not the AI. The problem is that you are trying to control the wrong layer.

Horizontal Scaling

Once you stop thinking of AI as a single assistant and start thinking of it as a workforce, something interesting happens. You realise you are not limited to one agent doing one thing at a time.

In a traditional team, you would not assign your entire engineering department to a single ticket. You would spread the work across parallel workstreams. One person handles the implementation, another writes tests, a third reviews the architecture. The same model applies to AI. You can run multiple personas in parallel, each scoped to a specific concern, each producing output that feeds into the next.

The key insight here is not just that you can parallelise. It is that you should start by automating the lowest complexity tasks first. The boilerplate. The test scaffolding. The documentation updates. The linting fixes. These are the tasks where AI produces reliable output with minimal oversight, and where the time savings compound the fastest.

It shouldn't come as a surprise that this is exactly how you would onboard a new team member. You would not hand them the most critical architectural decision on day one. You would give them well-scoped, lower-risk tasks with clear acceptance criteria, and gradually increase the scope as trust builds.

The difference with AI is that you can run ten of these "new team members" simultaneously. The horizontal scaling is practically free. What is not free is the management layer on top, and that is where your role shifts.

Workstreams, Not Prompts

So if the goal is to manage AI rather than dictate to it, what does that look like in practice? The abstraction shift is this: you stop writing code and start defining workstreams.

A workstream is a scoped unit of work with clear inputs, constraints, and expected outputs. It is not a prompt. A prompt says "write me a function that does X." A workstream says "here is the problem, here are the files involved, here are the acceptance criteria, and here is how I will verify the result."

If that sounds like a well-written Jira ticket, that is because it is. The tooling and processes we have been building for years to coordinate human developers turn out to be exactly what AI agents need. User stories, scoped tickets with acceptance criteria, definition of done. None of this is new. The only thing that changed is who picks up the ticket.

Let me give you a concrete example. Say you need to add input validation to all your API routes. The prompting approach would be to open each file, paste the code, and ask the AI to add Zod schemas. You would do this thirty times, reviewing each output manually.

The workstream approach looks different. You define the task once:

task: backfill-zod-validation
scope: all POST/PATCH routes in src/app/api/
constraints:
  - use Zod for schema definitions
  - validate request body before any business logic
  - return 400 with structured error on validation failure
  - do not modify response types or business logic
acceptance_criteria:
  - every mutation route has a Zod schema
  - all schemas are co-located with their route file
  - existing tests still pass
verify: run lint, type check, and test suite

This is not a hypothetical. This is how I run my own projects. The task definition becomes the management artifact. The AI handles the implementation across all thirty files. Your job is to verify the output against the criteria you defined, not to review every line of generated code.

As such, the skill shifts from "how do I implement this" to "how do I specify this clearly enough that someone else can implement it." If you have ever written a good Jira ticket, you already know how to do this.

The AI Workforce Model

To make this more tangible, here is how the model looks when you put it all together. You, as the tech lead, sit at the top. Below you are the personas, each operating in their own swim lane. The quality gates sit between the work and the merge.

You (Tech Lead) define workstreams for each persona:

Developer Persona produces code changes
QA Persona produces test coverage
Security Persona produces security reviews
Architect Persona produces design reviews

All output flows through Quality Gates (lint, type check, test suite, acceptance criteria) into a Verification step:

Pass → Merge
Fail → Feedback Loop → retry with context → back to the Developer Persona

The important thing to notice is that this diagram would look exactly the same if you replaced the personas with human engineers. The structure is identical. The rituals are the same: scope the work, assign it, verify the output, iterate on failures. The only difference is the speed at which the cycle runs.

In most cases, the feedback loop between a failed quality gate and the next attempt takes seconds instead of hours. The cycle that used to take a full review round now completes almost instantly. This is where the real productivity gain lives. Not in writing code faster, but in shortening the iteration loop between "here is my attempt" and "does it meet the criteria."

Quality Gates as Management

If AI handles the implementation and you handle the workstream definition, then what sits in between? Quality gates. This is where you as a human add the most value.

A quality gate is any automated or manual check that decides whether output is good enough to move forward. In most cases, you already have these in place. Your linter, your type checker, your test suite, your CI pipeline. These are not just developer tools anymore. They are your management layer.

Think about what a good engineering manager actually does. They do not write every line of code. They define what "good" looks like, create the systems to verify it, and step in when the output does not meet the bar. Taking this into account, your role when managing AI personas is the same. You define the acceptance criteria. You build the verification pipeline. You review the edge cases where automated checks are not sufficient.

Let me name a definitely non-exhaustive list of quality gates that work well in this model:

Linting and formatting to catch style violations before you even look at the code
Type checking to verify structural correctness across the codebase
Automated test suites to validate behaviour against specifications
Security scans to flag common vulnerabilities before they reach review
Acceptance criteria checks where you verify the output matches the original ticket

The interesting thing is that the more robust your quality gates are, the more you can delegate. If your test coverage is high and your CI pipeline is strict, you can trust AI output with less manual review. If your quality gates are weak, you will spend most of your time reviewing code manually, which defeats the purpose.

As such, investing in your verification infrastructure is not just good engineering practice. It is a direct multiplier on how effectively you can manage your AI workforce.

One quality gate that deserves special attention is documentation. AI agents go down. Services have outages. Models change behaviour between versions. When that happens, a human developer needs to step in and understand what was built, why it was built that way, and how to modify it. If your AI-generated code has no documentation, no architectural context, and no inline reasoning, that fire drill becomes significantly more expensive. Documentation is not a nice-to-have in this model. It is a quality gate that ensures your codebase remains maintainable regardless of who or what wrote it.

Thinking at a Higher Abstraction

Everything we have discussed so far points in one direction: the abstraction level at which you operate is moving up. You are spending less time thinking about how to implement a function and more time thinking about what the function should achieve, how to verify it, and where it fits in the larger system.

This is not a new phenomenon in software engineering. We have been moving up the abstraction ladder for decades. Assembly gave way to C, C gave way to higher level languages, manual deployments gave way to CI/CD pipelines. Every step reduced the amount of low level detail a developer needed to hold in their head and allowed them to focus on higher order concerns.

AI is the next step on that ladder. The questions you ask yourself change. Instead of "how do I parse this JSON payload," you ask "what are the validation rules for this endpoint and how do I verify they are enforced." Instead of "how do I write this migration," you ask "what are the data integrity constraints and what does the rollback path look like."

The skill shift is real, but it is less dramatic than people think. You are still an engineer. You still need to understand the domain, the architecture, and the trade-offs. What changes is where you spend your attention. Less time in the implementation details, more time in the specification and verification layers. Less time writing code, more time reading it critically and deciding whether it meets the bar.

And here is one thing every developer can relate to. How many hours have you spent staring at an error message in your terminal, tracing it back through five layers of abstraction to find the one line that broke? By moving up the abstraction layer, you spend less time in that debugging loop. The AI persona handles the implementation, your quality gates catch the failures, and the feedback loop resolves them. You get to focus on the parts of engineering that you actually enjoy, the problem solving and the architecture, and less on the parts that drain your energy.

What This Means for Jobs

So will AI reshape jobs in software engineering? Yes, it will. But the skills it requires are not as foreign as the headlines suggest.

If you have ever managed a team, you know how to scope work and define expectations. If you have ever written a user story with clear acceptance criteria, you know how to instruct an AI persona. If you have ever set up a CI pipeline or a code review process, you know how to build quality gates. If you have ever run a sprint retro and adjusted your process based on what went wrong, you know how to iterate on your AI management layer.

The abstraction level moved up. The skills did not fundamentally change. They just apply to a different layer now.

Of course, this raises an even bigger question. If AI can handle the implementation layer of software development, what happens when you zoom out further? Past the code, past the architecture, into the strategic layer. What happens when you apply the same model of personas, workstreams, and quality gates to idea validation, market analysis, and product strategy?

That is what I will explore in the next article. Because the same principles that let you manage a virtual engineering team also let you validate ideas faster than ever before.

I am currently building multiple projects using this exact model, and they will launch soon. If you want to stay in the loop and see how this plays out in practice, keep an eye on my socials.

Designing for e-commerce

Karel Vanden Bussche — Sun, 08 Mar 2026 13:39:14 +0000

Our young but innovative company created an innovative algorithm to bring, the most nutritious food for your beloved cat or dog, to your doorstep at exactly the time you need it. It might sound like something where no IT is needed, right? It's just bags with dogfood, how hard can it be? Well, this article will take you on a journey through what it entails to work for an e-commerce company, specifically Just Russel, as a technical engineer.

First, we will discuss how well connected the engineering department is as a department in the larger organisation. Secondly, we will go over the risks and problems that might occur in such organisations and which pitfalls you should keep an eye on. Next, we will discuss how our company, Just Russel has tackled all of these issues and we will discuss why it's not just e-commerce. Finally, we will show how even a stable industry as e-commerce, can still be disrupted by innovation, especially on the technical side.

A journey through e-commerce

When people think e-commerce, they instantly think about the Amazon's and Ali-expresses of the world. They think of large warehouses filled with stacks of boxes from floor to ceiling. They think of rows and rows of cubicles filled with customer support. They think of fleets upon fleets of delivery vans and drivers.

All of these are true images and challenges you find in an e-commerce platform. Each of the previously mentioned departments need to be connected to each other. The core holding all of these processes together in an efficient manner is conjured in a lesser known part of e-commerce companies, the engineering department.

Think of your typical consumption at an e-commerce site.

It starts with getting you to the website. Marketing optimises this based on data of previous users, retrieved from the website. This data needs to be collected, stored, queried and aggregated into dashboards that can be consumed. A step further is the automatic proposals for the marketing team based on the previous results. As marketing is a large chunk of every consumer facing product, being more efficient in this regard can quickly increase the margin the business wins on new customers.

As you click on the link you found on google, a new webpage opens up. It loads in the blink of an eye, welcoming you on the shop in a frictionless manner. All of the images pop into the screen seamlessly, and ... The site is in Dutch, my language! Well of course the programmer thought of internationalisation, as most e-commerce companies aren't active in a single country. When looking behind the scenes, the amount of daily webpage loads is in the hundred thousands to million. And you though scaling and optimising a landing page was easy?

You sign-up and enter the application. Here things get even more complex. We are no longer in a static website, no, we are looking at dynamic data specific for me!. Your data, along with that of the other thousands of clients, is powered by a database running relentlessly behind the scenes. Especially in more complex shops, like Just Russel, where we handle subscriptions and our food algorithms are computing the best food for your dog individually, it gets very complex. Well, we wouldn't be programmers if complex doesn't mean a challenge to be taken on.

Now, you decided it's time to buy something from this shop. You go through the food computation and customise your bags. You fill in your address and now it's the moment to pay. The engineers behind the scenes created (and maintained) the internal plumbing to get your payment handled and validated. As you paid the amount, a request is sent to a parcel service such that you, the client living tens of kilometers away, receive the parcel in a few days.

If you think high-level about e-commerce, it isn't rocket science. You buy something, the warehouse picks your order and the delivery people ship it to you. How hard can it be right? I hope I've shown you in the last few paragraphs that it can still be darn hard. All the processes that are taking place, plus, the sheer scale of consumers using the application, creates for all kinds of interesting topics to focus engineering efforts on.

Challenges and risks

After the last chapter, it shouldn't come as a surprise that there are quite some challenges and risks to engineering an e-commerce platform for scale. The "for scale" is very important here as it is indeed quite easy to build a simple platform that handles up to 100 users. Thinking about the scale of thousands of concurrently active users means thinking on a whole other scale. In the following section we will go over a few of these key challenges, how you can spot them before they come an issue and what to focus on to alleviate the consequences they bring.

Integrations

A first challenge that is easy to forget is how many integrations the core e-commerce platform has. Let me name a definitely non-exhaustive list to get the point across:

Parcel services
Payment services
ERP services
CRM services
Analytics connections

Each of these services extracts data from the system and uses it for its own purposes. As you can imagine, keeping a reign on this range of different integrations can be very hard. Good systems design and separation of concerns is a necessity.

Grouping your integrations and making them standardised early
on is very important for this as well. Imagine your company wants to switch to a new parcel service. That means another connection to maintain. If you didn't group your interfaces and make them standardised at your end, then you will have a hard time if a new feature needs to be introduced into both integrations. If you just switch, that is not the biggest issue, but imagine you need a new parcel service for the US, while keeping the old one for the EU. This means maintaining both and making sure both keep functioning consistently.

A few tips when working with this many integration:

Make sure the integrations are standardised. Each type of integration needs to follow only the functions you need from it. Each should be wrapped in a standardised interface.
Make sure concerns between integrations are neatly separated. Is your parcel service and payment service the same? Nice, but keep them in separate implementations. Looking at the code, you should not be able to know they are the same company delivering this functionality.
Document, document, document. Where and when you use these interfaces should be documented clearly, as this will increase understanding for both consumers & implementers of the interfaces. In time, this will make change management across the organisation easier.

Packages for your company

Karel Vanden Bussche — Mon, 04 Dec 2023 18:31:56 +0000

Every programmer has used packages in some way or another during his/hers professional career. Think of famous frameworks and languages like Vue, React, NodeJS, Python. Even though they are so ingrained in the developer infrastructure, most people haven't though too deeply on how they can benefit from introducing them in their own codebases.

This article dives a bit deeper in why you would do this, what you should keep an eye on and how they can help your cases.

Why packages

In my previous article (which you can read here), I explained why packages can be a good way of decoupling your code from each other. This makes your logic more easily swappable with new behaviours without needing to update all consumers in one go.

Now, of course this is not the only upside of packages. More abstractly, they are a great way of defining code to be used over multiple repositories. Imagine a large repository of models that is used across your entire organisation. Implementing these models in all consumers is practically very tiring. Keeping them all in sync is going to be a living nightmare.
Packages easily fill the role of mediator here. They contain certain pieces of logic & are static based on their version number. As such, if at one point a new entity would be released or a change would happen to a pre-existing entity, it's as easy as releasing an incremented version of the package & upgrade your consumers one-by-one. You could even upgrade them ad-hoc, when they would start using this new entity.

In the wild, we see this behaviour a lot. Think of any library you have been using, I'll use Vue as an example, as I have some history with it. Vue 3 was implemented on at the end of 2019. Only in a few months of writing, Vue 2 will be fully deprecated. Even though Vue 3 was out in the wild, consumers still had the option of using Vue 2 until maintenance of the second version didn't make sense anymore. I'm sure, if you think a moment on it, you will be able to collect a few cases in your own domain. Most frameworks and even programming languages, use the same versioning behaviour.

Lastly, a major upside of using packages, is that the consumer is still in full control. You want to overwrite a certain behaviour, that's completely possible by extracting the function & overwriting it. Want to debug a certain issue, well, as you are running this code locally, you can put breakpoints wherever you want. Relate this to microservices, which are inherently made blackbox, it is much easier from a developer perspective to work with packages than it is to understand someone else's microservice.

All of this is very well, but using packages the right way isn't always given. Especially on the operational side, it can be hassle.

Operating a package infrastructure

Introducing packages in your organisation isn't as easy as you would think. Sure, packaging it & hosting it is definitely doable with a few lines of CI/CD definitions and a private package repository. Sadly, that's not where the lifecycle of a package stops. Keeping it up to date & notifying people of updates can quickly become cumbersome. Your organisation (or at least your team) needs to think about how to introduce it easily in the organisational processes. In the following paragraphs, we'll discuss a few things to keep in mind, keeping your processes in sync after the introduction of packages.
We'll dive a bit deeper into how you can easily deploy & keep your packages up to date with your code repository. I will explore a few ways of dealing with ownership over packages. And link that to strategies lowering work for either consumers or producers of packages, or somewhere in between.

Maintaining versions

Keeping things up to date is hard. Most things that need constant maintenance are dreaded by even the most organised people. Think of documentation, the value is obviously clear, but still most programmers prefer doing other things over creating documentation.

The same holds for packages. If you spread your logic over more consumers, all of those consumers will need to have their versions upgraded and will need to follow your patches, bug fixes and deprecations. Especially the last 2 points are the hardest to get right, as it might require a lot of well-executed communication to keep all consumers in sync.

There are a few strategies to deal with this, which we will explore in the next few paragraphs. Obviously, no single solution will be the golden nugget for your organisation. If you decide on a strategy, make sure you have cross-team buy-in as well, as misusing the strategy might spread like a contagious decease across you engineering department.

Broadcasting Strategy

The first strategy is to do as most open-source projects do, communicate about upgrades and deprecations in advance. Each upgrade that comes out can be checked on a certain branch and changes can be included in this unpackaged version. This helps communication, as consumers can follow along with the changes for the new version.
Secondly, as the version gets released from this branch to a new version on master, the changelogs are communicated in a uniform format: changes, fixes, deprecations and potentially a migration guide for major releases. This helps the consumers to understand if the changes impact them, either in a positive way (a bug got fixed) or a negative way (a function got deprecated).
Finally, for major increments with complex migrations and multiple deprecations, both versions stay maintained during a certain time window, which gives consumers more time to do the migrations before the prior version becomes unmaintained.

One can already spot a few inefficiencies in this model, depending on the use case. Maintaining 2 versions of the same framework, just to keep old consumers running might be a drastic waste of time for your developers. Both doing the communicating and maintaining both version takes focus away from the team.
Secondly, communication is hard, especially if it is needed cross-teams or even cross-department. The risks stays present that a certain team didn't receive or ignored your deprecation warnings.

In general, the open communication style lends itself well for fully decoupled projects for key projects. The massive energy drain of double maintenance and broadcasting communication, might need organisation changes that might not be worth the effort. If on the other hand, your company is already fairly well structured with regards to cross-team communication, you might find the low touchpoint of the broadcasting strategy interesting for your use case.

Subscription strategy

The next strategy can be seen as a watered down version of the broadcast strategy. If your package will only be used by a select number of consumers, maybe you don't need to broadcast your changes to all consumers. If you can have people subscribe to your package, listing the functions they are delegating to your code, you can compile a list of teams to functionalities. If your new version then changes this functionality, you can let the people know and define maintenance timelines and deprecation windows together with them.

Reading between the lines, this means that consuming teams must have the diligence to subscribe and, a bit harder to ask, keep the functionalities they consume up to date. If this is not the case, you might forget to involve a consumer regarding the windows outlines above.

Summarising, this is a great low-touchpoint way to allow reusability across teams, while not spamming those people with updates on the state of the package. If you are not up for endless communication, maybe this strategy is for you. This puts a lot of responsibility in the consumers' shoes, maybe you don't want that. Especially in a single team context, this wide communication might not fully make sense.

Push strategy

A final strategy to explore is the completely opposite of the broadcast strategy. Only major version upgrades, in most cases with deprecations, will be communicated, if necessary. Only those cases can actually form issues with consumers, so why bother with communicating every small detail?

There is another way to go about this, which works quite well in small teams or simple bounded contexts. Why not, as part of your release process, plug in the new version of your package across all consumers? Especially if you added for example an improvement to a certain functionality, why update the version numbers one-by-one in each consumer. Why not put that power into the hands of the person merging with master?

This can easily been done with a CI/CD pipeline. Imagine your second to last step is the packaging and deploy of a certain version. Then, you can make the last step of your pipeline checkout the consuming repos, make the change to upgrade the version and finally, push the changes to the original repo. After this step has run, the normal CI/CD pipeline of the consuming repository will be triggered due to a code change and will eventually be deployed with the new version.

This strategy assumes a few things, making it not viable for too large contexts. First, it assumes you know exactly what the producers use and the impact a rollout of a new version has. Secondly, you should have access to the consuming repositories, as you need to upgrade and potentially roll back a version if necessary. Finally, you still need a lightweight form of either of the above strategies within your bounded context.

If you do know your context is small, for example, a single team, it should be fairly straightforward to recycle common definitions used in multiple of your owned repos by putting them in a shared package. You could then, after a definition is changed, update all of your downstream repos. In general, this works quite well, as the entire team knows what the impact is of changing a certain piece of functionality. If not, an MR review should catch such issues.

To summarise, if you are working in a small context, where information of impact is already quite shared, it might be an option to put the power over the consumers in the hands of the producers of the package. Deprecations still follow their own path, but in most shared packages, this happens only sporadically and in a well organised way.

Conclusion on strategies

In the paragraphs above, we saw 3 possible strategies you can employ in your own team or company to allow for easier sharing of code via the means of packages. Each strategy exists on a spectrum, from a lot of power to consumers, to more power to producers. Each came with its own benefits and downsides, shifting your ideal solution possibly somewhere on the spectrum.

All of the above strategies require some form of buy-in from your team members. The first strategy a lot more so than the last. This means it requires some change management to look into this, but starting small, with the most simple practicalities, helps building a more complex strategy in the future.

Decoupling with Packages

Karel Vanden Bussche — Thu, 23 Nov 2023 09:16:25 +0000

Decoupling

A lot is being said about decoupling systems from each other in software development. Nobody likes to be blocked by another team where you have limited control over. Decoupling helps with this issue by making only the interaction between multiple components shared, while keeping the internals hidden from the outside world.

Developers also found out that decoupling isn't only handy when dealing with overlap between multiple teams, but even in a single team, having overlap might make things more complex. Heavy dependencies on a certain component in your business logic, might make it harder to swap that component for something else later on.

Both are valid use cases for decoupling. Both define cases where you do not want your components to be a singular monolith, where no single brick can be swapped for another. Decoupling helps with making the bricks hot-swappable, without having the entire structure tumble down.

Microservices for decoupling

I've talked about decoupling in the last paragraph & why it can be useful. I have not talked about how to implement it. Most programmers, especially in the backend-space, will instantly shout "Microservices are the best way to decouple your components!". To a certain level, I agree, but let me first introduce what they are on an abstract level, before I go over reasons not to use them.

Microservices are small nodes of computation, each node having its own endpoints as interfaces & acting as a micro-webserver. A microservice on its own has no value, as it only defines very basic operations in, in most cases, a singular business domain. Obviously, with a single business domain, you cannot build an application.
Most microservices are built in a type of mesh, where one microservice can communicate with another. Another type of connection is to the API gateway, which connects the outside world, with your inner mesh of microservices. This gives you the ability to connect your external interfaces with a reaction-chain internal to the microservice mesh.

Taking this into account, we immediately can list a few benefits:

Very simple components that can be easily connected into a chain of operations.
Easier ownership of each business domain, as they are linked to one or more microservices.
API gateway hides the inner complexity of your microservice chain & mesh.

There are thus quite a few upsides of using this architecture, especially in larger teams. Of course, we also need to highlight some downsides:

The chain of operations can become quite complex if undocumented. The fact that you use a loosely coupled system, means that the history of the operation chain is lost if not actively kept.
Communication between servers is faulty by nature. gRPC tries to solve most of these issues, but can't hide all of them.

In this article, I want to propose another solution of the second issue, mainly to be used for sharing logic in team-internal systems.

The Package

Let me start with taking a step back before we dive deeper in the other proposed solution and why it works. I would first like to start with explaining what a package is. Later during this article, I'll make the jump back to our initial problem of decoupling.

A package is in most cases something that contains a few different items, but encapsulates the entirety of those items in a single whole. In distribution, this can be a carton box. In sales, this can be a collection of features to be sold for a certain price. In software engineering, this is in most cases a collection of functions that are collected and boxed to be used somewhere else.

In practically every software project, you depend on things other people implemented & exposed to the public. Github is full of open source projects, some of which you have undoubtedly used in your own projects. Think of React.JS, Gunicorn, FastAPI, Lombok, ... Each of these projects needs to package the functions & utilities of their implementation for use by others.

A few benefits:

Code running in this way is an integrated part of your own logic, making it easier to debug, jump around in the code and possibly overwrite certain behaviour
Implementation of new versions can happen without the necessity to upgrade immediately due to semantic versioning.
Most programming languages have packages as a first class citizen. You cannot, after all, implement everything in your own codebase.

Packages for decoupling

Notice, in the last paragraph, that we defined large open source libraries. Note how you probably have never touched any of those, but are still using them frequently and without issues? Well, this is exactly a way of decoupling logic. In a strict sense, you are still running the given logic on the same machine as the rest of your logic. From a business perspective though, you are fully decoupled from the people implementing these packages. None of them (excluding rare cases) work in your team or even company.

As such, I propose building & reusing packages as another way of decoupling. Thinking of the benefits, you can easily see that it is possible to decouple and organise the implementation of certain business domains easily. As such, let's take a look a the benefits & downsides between using a microservice architecture against a package:

Microservice benefits:

It's seamless to upgrade your logic without other teams' intervention. For packages, you need to manually update the version to follow the upstream.
The microservice architecture has more frameworks & buy-in from the development community.
Each microservice can choose different machine resources for its workload. Some services might need more CPU/Memory/Scaling than others.

Package benefits:

You are running your code integrated in your larger code base. This lowers complexity of the solution, as operation chains and dependencies are easier to follow.
You are running your code on the same machine, meaning you do not need to think about cross-server communication and request-response flows.
You can choose when to upgrade your service to a new version, giving you time to upgrade your code at your own pace.

As you can see, the benefits for each are fairly balanced. It is thus a matter of which benefits outweigh the other for your specific application. If your package doesn't add a big load to your machine's resources, I think they are a valid way to increase decoupling in a low effort way.

Conclusion

We started explaining decoupling and why it is such a big topic for software engineering, especially in larger corporations. Keeping your business domains cleanly separated and upholding the possibility to work on each in isolation, helps businesses move rapidly.

In most cases, microservice architectures are used to lower the coupling between components, as only the communication is shared between the different services. This means only interfaces are shared & internals hidden. Of course, this is not the holy grail and there are quite some downsides as well, such as less visibility in both the operation chain & the effects of downstream calls. Complexities of API calls are also not to be underestimated.

We explored the definition of a package and raised that it might be a good alternative to microservices to decouple certain parts of your code. The characteristics of both solutions were put side-to-side to show in which cases packages might make sense for your solution.

Hopefully, this article inspired you to not blindly follow microservice or monolith architectures as the holy grails of development. The landscape of software engineering is as wide as it has depth, meaning there are practically always more solutions than the single one your focus falls on now. Keep an eye out for those as well, as the holy grail is always dependant on the application.

Flexibility in Integration Engineering

Karel Vanden Bussche — Fri, 17 Nov 2023 09:59:13 +0000

What are integrations

The word integrations is used in different contexts. In most cases, it defines something that unifies two things into a single whole. One country can integrate another, as happened a lot in history when empires swallowed smaller states. A recipe can integrate a fresh ingredient to make a new entirely new experience.

Finally, and the topic of this article, are data integrations. These integrations combine multiple data sources into a new whole. Each integration adds to this unified data model, leveraging the scale of the data to bring forward new ideas, tools and operations.

In general, most data integrations act on data streams. Each stream goes from a "chaotic" specific representation to a unified format that is shared across all integrations. We call this a unified data model.

Such a data model can power multiple applications, while hiding the complexities of the underlying systems. If you for example query google, you can find links to PDFs, webpages, images... All of these are thrown in a uniform representation, the link. Only by following the link, and thus diving deeper into the data, can you find the actual underlying source.

The demons of a unified data model

As discussed before, each integration has a specific data model. Of course, not only the data model is important to take into account. Below we will discuss a few things that are important dimensions and attributes of integrations.

Integrations can be made via multiple ways. Focusing mainly on data ingestion, you can ingest data via:

APIs
Webhooks
SFTP (files)
Database queries
...

Secondly, the format plays a big role in how you will extract the data. A few examples are:

JSON
XML
CSV
Binary (Excel sheets, gzipped data, proprietary formats, ...)
...

Finally, the last dimension I'll expand on in this article is something that might not be important for your application, but if it is, defines largely the capabilities of your system, liveness:

Realtime
Query-able
Scheduled

Each of the dimensions above adds complexity to your uniform data model. The downsides of all of the above attributes need to be incorporated in your uniform model, as otherwise, you don't have a unified whole.

Why flexibility is key

Understanding the complexities of the huge landscape of possible integrations helps with mapping your route through it. Sadly, even if you would understand the entire landscape, that still would not be enough to architect a perfect solution. Imagine if a new integration comes on the market after you have done your analysis & it uses an entire new way of working. Then, you will be scrambling to fit it into your well defined framework.

As such, I keep a simple mantra when building integrations: "Flexibility is key".

As flexibility is hard to focus on due to the definition being literally: "able to cope with chaos". In this article, I will nonetheless try to give a few pointers on how to keep an eye on your code's flexibility. Flexibility exists on a scale, as staying flexible takes effort as well. In most cases, the balance between effort and flexibility needs to be made. Business should not invest effort in flexibility that tries to reign in chaos that does not exist yet.

How to deal with flexibility

Following the last paragraph, the pointer I can give is, try to map the chaos you are dealing with. This is inherently a difficult task.
In most cases, you are not working in a system without any limits. You can in most cases put some restrictions over the possibilities you are trying to map. Each possibility will have a probability, even if you need to estimate it.
The first step in scoping where your flexibility must lie, is thus understanding the probability of each restriction to occur. This will help you understand the priority you need to assign to deal with each.

Secondly, a good way to make it easier to add new possibilities to your code is to separate your responsibilities cleanly. If you work in engineering, this statement is not going to be new to you. Separation of concern is a big topic in software engineering, rightfully so. If your components are split in logical components, adding an extra consumer or an extra interface will not be the end of the world. Keeping an eye on building your software in both logically split steps and components with well defined responsibilities will help you expand your implementation more easily.

Lastly, taking the time to fix the assumptions that have been invalidated by reality, will help keep your solution robust in the long run. Understanding that you made your analysis, but it wasn't truly wrong, just not 100% accurate, is important to stay flexible yourself. Take the time to rewrite your code slightly instead of building a quick & dirty solution on top of your previous solution might be the best way for future you.
Here, I would give the tip to use the same prioritisation as we explored before. If only a single of your integrations uses a certain way of communicating, go for the one-off quick & dirty solution. If you see more and more occurrences, maybe go over your analysis from point 1 again & see if it is still according to reality.

With these tips, I hope that you have a few more tools to reign in chaos and stay flexible in the ever changing world of software. You never know what next year will bring...

Conclusion

In the article we spoke about the nature of integrations. How they are the uniform mesh holding multiple homogenous systems together. Due to the non-uniform nature of the underlying parts, the mesh needs to be able to deal with complexities.

We looked into a few types of dimensions in which different integrations can differ & how each strains your mesh slightly more.

To deal with this, we focussed on why flexibility is key in such environment. Why it is so important & how to get a good overview of it.

Lastly, we discussed ways to lower flexibility & thus also lower hurdles to integrate with new systems. I gave a few personal tips on how I personally keep the flexibility high, while keeping the implementations simple.

My hope is that this article opens your eyes to the complexity of staying flexible in more & more complex environments. I hope it gives you some tools I've personally used in the past, to aid you in increasing flexibility such that you remain able to integrate new & innovative solutions into your applications.

Forgiveness as an Engineer

Karel Vanden Bussche — Tue, 07 Nov 2023 18:11:57 +0000

What is forgiveness

According to the dictionary, forgiveness is "To give up resentment against or stop wanting to punish (someone) for an offense or fault; pardon."

This definition is very broad and can thus be applied in multiple contexts. It can be used to remove blame from your friend that offended you by joking or it can be used to stop resenting yourself for making that mistake 5 years back. All of these are uses are to distance yourself from anger or other negative feelings attached to someone or something.

The amount of contexts you can use this skill in is endless. I would like to focus on this article on how forgiveness can help you as an engineer.

Forgiveness in a team

It might sound like a weird place to apply this, but if you think about it, it really isn't. Anger and other negative emotions lash into interpersonal relationships and makes it harder to work with others. There exists no company that is not made up of different individuals with who your relationships define how you work with them.

As this is the easiest translation, we'll start with discussing the myriad of ways it can help you work together better with your colleagues.

Let's say John and Mary are both part of a company, where John is the main contact person to team J and Mary to team M. Mary was overruled by John during a certain discussion and now she holds quite a grudge. She resents him for it and this creates tension between them. This is quite a predicament, as Mary and John connect 2 teams. As such, communication deteriorates rapidly and the teams become isolated.

Another example is the following. Andrew leads a team that contains Bethany as a direct report. Bethany is new to the team and doesn't really know how things work. With a certain development, she messed up big time. Andrew takes most of the blow and resents Bethany for making the mistake and messing up.

In both cases, negative feelings rapidly degrade the performance of team members or entire teams. In the first case, the communication line that was critical for the company, stalled, as Mary procrastinated working with John to not increase her frustrations. In the second case, innovation by Bethany is reigned in, as making mistakes will mean less chances of promotion by your superior.

In both cases, if each person would forgive the mistake, the problem would slowly fade away in the background until neither even remembered the mistake, or they laughed about it later at a team event.

I am not proposing that you need to forgive everything, but as in all things in life, it's a matter of gravity and balance. The stakes are high, not only for you personally, but for the company and your role as well.

Forgiveness of yourself

When talking about forgiveness, people tend to focus on the external world and the relationships they have with other people. The most important relationship you have though, is with yourself.

I identify a few cases where the art of forgiving is important. A first case is when you resent yourself for making a mistake, a stupid decision or a weird interaction you had. For all of these, you have the choice between resenting your past self or forgiving him such that you can look at the behaviour objectively.

All engineers have at one point thought about their past work: "Well, why did I write it like this?". Especially when dealing with legacy systems and processes, this thought reoccurs often. In this case, there are 2 ways to deal with it, either you blame your past self for making your job harder, or you objectively look at the problem with forgiveness. Your ignorant self could never have known the full scope of complexities the system currently deals with. As such, can you really blame your past self?

When you think of it, it is surprising in how many cases this is also linked to the Imposter Syndrome. When we make mistakes, we blame ourselves and see ourselves as less adept at the task at hand. Even if we solved our mistake rapidly, we keep a nagging feeling about our past actions. Letting these feelings go can help you with, instead of playing the victim, find the root cause of the mistake and fix it.

Conclusion

This short article focuses on a few places where not only forgiveness, but in general taking a stoic approach to programming can help you in multiple ways. It can help you fit in a business where interpersonal relationships drive most of the value. Secondly, it can help you mentor people without any linked judgment. Lastly, it might help you be a bit less hard on yourself the next time you make a mistake in your life.

In the end, even though they are important cues, emotions pass. This thought might help you the next time you're fretting about the stupid decisions you made many years back.

Why Queues for Streaming?

Karel Vanden Bussche — Wed, 12 Apr 2023 17:24:30 +0000

Big Data brings Big Headaches

In the Big Data space, a lot can go wrong, in multiple places, at multiple times during the day. Everyone wants to avoid the midnight automated alerting phone calls, quick fixing of production on a weekend day or even worse, the monster under the bed, data loss. Data intensive applications in general require a lot of resilience to different types of issues:

Network issues (API requests timing out, ...)
Code issues
Transient issues (Database being overloaded, ...)

Issues like the ones above make it hard to keep a consistent state within applications. When data quality is a hard requirement, this becomes a big problem during processing. Add the big to this data quality and costs quickly explode when doing unnecessary (re)processing.

To make sure we don't lose any progress between units of computation, persisting the partial result is a common strategy within data pipelines. This helps us store the previously computed data, even though we might be hitting certain issues. Work can then be quickly picked up from the last successful step.

This principle works for both batch pipelines & streaming pipelines. In batch pipelines we can store most of the data in bulk in a database, blob store or other persistent storage. In streaming pipelines, we mostly use queues for lightweight messages or events.

In the following article, I'll go into more depth into the benefits & downsides of using queues to increase the resilience of your pipelines. First, we'll define some terms to be used further in the article regarding Tasks & Workloads. Next, the paradigm of queues will be explained with regards to streaming Workloads. A short example is given on how you could use queues to their full potential in your own project. Finally, we'll briefly discuss a few common implementations of queues, bucket queues, priority queues & Google Cloud PubSub.

Tasks and Workloads

Within the data engineering field, we can split most workloads in distinct steps. Each of these steps represents a unit of computation that can't be split into any further meaningful subtasks.

A great example of such a "meaningful" subtask is the ETL definition. Each step in the classic Extract-Transform-Load pipeline cannot be split into any smaller subtask without having significant overhead. You could split it further into more granular computations, but persisting the values to a persistent storage from the in-memory model, would outweigh the benefits in cost.

Each of these tasks defines its own input & output. The implementation of the task defines the way the input is transformed into the output.
The input will in most cases be a value or file. The output can be a value or file as well. Lastly, one or more side-effects can happen in a task. This can be a write to a database, a request to an external service, ...

Combining a chain of tasks is called a workload. A workload defines a directed, acyclic graph of tasks to be run after each other. For the sake of this article, we'll also call this a pipeline, being a chain of individual tasks.

I borrowed this terminology from Apache Airflow, to use something that is known by most data engineers & translates well to the following paragraphs on queues.

Below a Directed-Acyclic Graph is shown. All tasks trigger a subsequent task & no cycles are present in the graph. The directed nature is required to know which task's input depends on which task's output. The cycles are not allowed as otherwise your workflow would never end. There do exist use cases of this, for example game loops, but these are not within the scope of this article.

Of course, each task can go horrendously wrong. For example, B fails in the above example, thus D is not computed. The challenge then is, how do we, as efficiently as possible, fix the entire pipeline...

Queues, the underrated time manipulators

Let's take a small step back. We know about the persisting between steps & we know we don't want data loss. But how will we in practice apply this?

For streaming application, we mainly use queues for this. These simple constructs are like a dynamic window over the theoretically infinite number of incoming messages, pushed by publishers. They dilate time until a consumer starts the clock again, each second in this example being a message coming in.

Most people understand the core concepts of queues, but there are some implicit practicalities that are interesting to explore. If we look at this figure, a few benefits can be inferred:

Data isn't lost until the consumer dequeues a message. As such, a queue is a type of persistent storage, even if "persistent" is only a short amount of time.
Data ordering is kept. This can be necessary when you get events that need to follow a logical ordering or older messages depend on newer messages. An example could be writes to a DB, if we write A = 10 & A = 15 afterwards, we want the latest value to be written last & be the final value.
Queues can be used to disconnect consumers & producers from each other & have them loosely connected. As a practical example, imagine that your publisher is implemented, but your consumer is still in review. When you put a queue between them, you don't have any issue, as your data will be persisted in the queue as long as it doesn't exceed the window size. This helps alleviate most of the transient issues, for example when your consumer's machine is suddenly evicted, which then no longer blocks your publisher from pushing messages.
Resilience against code issues in the consumer can easily be remedied by wrapping your entire logic in a try-except. When you encounter an error, you can act as a publisher & enqueue the message you dequeued to the queue again. The main downside of this is that the in-order property of the FIFO queue is lost.

Be aware that no data structure is truly infinite. The same limitation applies on queues. The window size is the upper limit of the amount of entries, or maximum amount of bytes depending on the implementation, a queue can hold. The decoupling thus only holds until the queue overflows, at which point data loss is still happening. Make sure you build for this & slightly overscale your consumers to make sure you can actually finish for all elements of your queue.

Benefits of the Task model

Now that we know some attributes of queues, we can link back to the Task & Workload model. Remember that a workload is a chain of Tasks linked to each other by a directed edge. For batch operations, it's easy to persist your data & trigger the next step. This is possible because a main attribute of batch is that the stream is finite & we thus know when to trigger the next step. For streaming, this attribute doesn't hold. All streams are theoretically infinite.

We can reframe our way of thinking regarding streams. Each element of a stream can be seen as a mini-batch. If we take this a step further, we can translate the tasks in the batch example to the same task, but done on a mini-batch. Now we actually do know when a Task is done for streaming & we can trigger the next step.

To put this in practice, queues are the ideal trigger. A few things make it perfect for this use case:

Decoupling makes it easy to alleviate data loss when one of the multiple consumer machines is suddenly unresponsive.
If the consumers cannot keep up, no data loss is happening (until the window size is hit).
The queue size & oldest message age exposes a lot of information on issues that might be happening regarding scaling or evicted machines. This helps with observability of problems I wrote about here.

When applying queues & mini-batches to the Tasks & Workload model, we get the figure below. Each task works on a mini-batch, represented as a box in the queue. For brevity, if the queue was fully successful, no boxes are drawn on the figure below. Note that individual tasks can still fail, impacting the final workload. Data engineers should make sure the fallback implementation follows the application's needs.

Streaming ETL

Every data engineer has heard about ETL. When thinking about ETL, we mainly think about large data sets we need to transform & persist in some database. Batch processing works splendid for these types of behaviour & business needs.

Sometimes realtime data is required. An example could be a weather application receiving data from multiple sensors all over the country. People will not want to wait an hour to know it's raining one village over. In this case, realtime or mini-batch processing makes more sense.

Realtime requires us to not work in batches, but mini-batches that align with our update frequency needs. In the case of the weather app, updates every 5 minutes would be good enough, while for certain applications, realtime means sub-second. As previously explained, mini-batches & streaming share a lot of characteristics.

A possible implementation of the weather app could look like the figure given below. Each of the steps in E-T-L has its own queue, for their own type of resiliency.

First of all, data needs to be collected. For the sake of making the example practical, we're going to collect data for a weather app explained above. We have a lot of sensor spread over the entire country. Each of the sensors outputs a steady amount of data points.

The first step in our ETL pipeline is the Extraction step. A possible solution for this is a small cloud function that takes the data, inserts in some kind of database or blob storage (to prevent data loss) & sends the location of the persisted blob of data as a message to the output queue of the extract step. Let's call this queue the E-T queue.

Secondly, the raw data goes to the Transform step. This step will then add some extra computed columns, for example compute the volume of the precipitation. If we want to apply (pre)aggregations in any way, this would also be the place to do this. The output is a message containing the single processed value. For future reference, this queue would be the T-L queue.

Finally, the Load step will persist the value that came in. As this step is very dependent on the database load & locking behaviour, some messages might go faster than others. After this step, the data is persisted & can be used by the application.

Note that the faster this pipeline completes, the faster data is available in the application. As such, we prefer to have regularly updates & thus a low latency between Extract & Load steps.

The 2 queues we discussed make use of different benefits of queues. First, the E-T pipeline has the following benefits:

As transform steps are the most code-heavy implementations, the previously discussed retry mechanism can be applied here to build resilience against implementation issues. Do note that if we use this mechanism, we need to have idempotent implementations. If ordering is required, this retry mechanism can't be used in its simplest form.
Secondly, observability is key for transformation steps. Having a good idea about queue size & queue length will give us the necessary information to throw alerts & creating monitoring dashboard. We can also use this information to help with scaling behaviour, as most of these transform steps are stateless processes & can be scaled based on current workloads.
The decoupled nature of publisher & consumer is very important here. We don't want to drop messages if the consumer is down. This would otherwise mean data loss for our application, which has direct business impact.

The second queue we introduced, is the T-L queue. This queue is mainly constrained by the database load. The queue helps with a few things here:

We are unsure if the producers outputs as fast as the consumer can persist messages. Here the decoupling makes sure that this does not block future messages from being persisted.
As we might hit transaction aborts or database connectivity issues, we want to be able to still insert these values. As such, the retry mechanism discussed above is a must-have for this behaviour.

I hope this example made it clear that queues can be used for a number of benefits. Mainly the persistent nature, the decoupling & ordering (in some cases) are the attributes which just make sense in the workloads/pipelines & task context.

Other interesting queues

As the FIFO queue is not the only one, let me give you a few other examples of interesting queues & what they could be used for.

The first to tackle is the bucketed queue shown below. This is basically multiple queues in one. This might be handy when you know you need for example one value of each type. In case of the weather app, this might be a sensor value for pressure, temperature & humidity, to combine them into a single data point later on.

A more specialised queue based on the bucket example is the priority queue. This queue is specifically made for when certain messages should be processed with a higher priority. Main use cases are when different priorities are also present in your workload or tasks. Imagine the weather app case & we have extreme weather events coming in. These events should immediately be visible in the app without delay. In such a case, a priority queue could give these the highest priority & push them first out of the queue to consumers.

Finally, a message queue I use a lot, is Google Cloud PubSub. This queue has a few interesting attributes.

First, it doesn't send messages to the subscriber, but "leases" the messages to the subscriber. What is special about leasing is that, if the message is not acknowledged, the message is returned to the queue. The acknowledgement happens when a subscribers sends a message to the queue at the end of its processing time communicating that the message has been processed correctly. This helps implementing the retry-mechanism discussed frequently in the previous paragraphs. Observe that here, alike the simple retry-mechanism, operations need to be idempotent or transactional, as otherwise you might have duplicate entries or other data anomalies.

Note that this also means that ordering is harder to implement within these types of queues. To actually implement this, bucket queues are used. Locks are used to make sure you don't receive the second value in the bucket before the first is acknowledged.

Secondly, it is not a 1-to-1 queue, but a broadcast queue. Each message on the topic is sent to n subscriptions linked to the topic. This means that it's easy to have 2 outgoing edges in your DAG, without needing to push 2 messages yourself and thus possibly failing on one or the other. In our example application, this could be persisting the transformed messages to both a database used by the app & an analytical database used by researchers.

Lastly, the PubSub queue is elastic. This means that, where previously I mentioned that window size was a hard limit before data loss, this is no longer the case, as the capacity scales horizontally & dynamically based on the needs of the queue. This is particularly useful when you have large spikes of messages coming in suddenly. In these cases, PubSub will scale to the necessary capacity, making sure you don't have any data loss. Do note that even here there are limits, as PubSub then limits based on time on the queue. Secondly, you don't want to keep billions of messages on the queue too long, as they do have a cost associated with them.

Summary

In this article, we've started with a few types of issues we might have during our processing steps. Transient issues like network problems, or code issues can have a lasting impact on your data. We've shown that data, especially big data, can have multiple types of problems.

First, we tried to define what a Task is, to help us with splitting larger processes in smaller steps. We defined Workloads or Pipelines as a chain of Tasks to be ran one-by-one. The issues when a node in the chain fails were easily spotted.

We've defined that streaming has a slightly different definition & requirements than batch, but that we can still think about streams as mini-batches. The benefits of queues in this context were shown, mainly the decoupling, persistent nature and ordering within queues help with keeping the issues between Tasks in check.

Next, we've tried to put this into the context of a simple weather application, showing the use cases of queues in an streaming ETL workload.

Finally, we've explored a few other interesting queues. The bucketed queue, which defines buckets of messages to be pulled. Secondly, the priority queue that hides the buckets & returns always the highest priority messages. Finally, the managed queue of Google, PubSub.

I hope this article gave some insight on why queues make sense for different behaviours. The practical application hopefully gave you some insight in actual cases where queues shine. Lastly, I hope this article made you interested in learning more about this data structure, as they are indeed amazing tools in the repertoire of a data engineer.

Thanks for reading!

The Pyramid of Alerting

Karel Vanden Bussche — Tue, 04 Apr 2023 10:56:57 +0000

We all came across it before. Your company is processing thousands of data points. All goes well on most days, but today, the pipelines are saying "not today". You start digging and find out that the issue is due to a single invalid or malformed message. You check the message for issues, identify the breaking issue (it's a new date format, who would have guessed) and implement a fail-safe for this specific issue.

Some people might say "Well, you lost quite a bit of time digging for the root cause, why didn't you program the fail-safe in the first place?". As data-engineers, we are well-trained in the art of balance our efforts to stay efficient & provide business value on the other end. There are hundreds of integrations to be made, but only little time and competitors don't wait. Secondly, data is so chaotic, that it is impossible to create guards against each mutation. Lastly, each hour of programming, has a business cost. As such, the new integrations have priority over a few edge-cases, such as faulty files or wrong assumptions. That is completely fine if you weave your net around the weakest points. It's only when certain assumptions weren't validated that the issues start.

This balance between efficiency and robustness is an important equilibrium. The fact that postponing certain complex fail-safe developments brings more efficiency to the team even if the issue bites back later. The 2 cards that keeps everything from falling down in this precarious card stack, are monitoring and alerting. It sounds boring to most people, but monitoring can really help identify silently failing processes in your infrastructure. These failures might grind your pipeline to a halt or could silently destroy data integrity.

At OTA Insight, we try to actively work on moving issues from the silent category to the alerting category. Over the years, we have had both types of impact arise, though quickly worked to resolve them permanently.
Sometimes our messages fail to transform and due to our aggressive retry-strategy, we keep on retrying these invalid or corrupt messages. When this issue is not resolved, the pipeline has significantly lower throughput and we lose precious CPU-time.
Other times, our integrity checks fail, meaning we did not know we are not loading any data for multiple days.
Thinking about what impact a new flow might have and brainstorming alerting at scope-time might help resolve the biggest issues that are low-hanging fruit.

The levels of monitoring & alerts

Over the years, our team has learned valuable lessons on monitoring. When we now implement a new solution, we have a few key levels of monitoring we plan for. The image below shows the hierarchy of alerts we currently look at for new flows or business logic.

In the following sections, we'll go over each of them, bottom to top, and show what they entail and how they help us recover from certain disasters.

Operational alerts

The first type of alerts are the operation alerts. This category focusses on making sure that everything keeps running smoothly at the service/bare-metal level. These alerts monitor for example how many messages are in a certain queue and what the age of the oldest message is. This gives us a good idea if our processes are healthy.

Operational alerting should have an internal hierarchy. Each operation will encounter spikes at multiple points during their lifetime. This means that certain alerts might be raised falsely or their importance might be inflated. As such, we deem it worthwhile to define multiple steps of thresholds, each with its own importance. High importance alerts should be triggered as soon as possible, while low importance alerts should be looked into, but not necessarily now. In most cases, it is possible that a low importance alert evolves into a high importance alert. Due to this, it's also valuable to look into low importance alerts before they evolve and cripple the rest of the data pipelines.

Say we did not have these alerts. That would mean that when our processes wouldn't be healthy, no alarm bells would start ringing. This would be a disaster, as an invisible issue will not be actioned rapidly and might go on for days or weeks without anyone knowing.

Data validation

Data validation has a lot of definitions, so let me explain what I mean with this first. The previous alert gave an overview of operational load & throughput. What it did not show was if certain values were received and if the amount of data was (more or less) correct. Other metrics that fall under data validation are:

Did we receive (enough) files for a given date?
Was the amount of data relatively close to the average amount of ingested rows of previous days?
Can the incoming files be unzipped/loaded correctly?
Did we receive the file at the correct timestamp?
...

The validations are slightly more complex, as the implementer should have some prior knowledge over the dataset. In the case of operational alerts, no prior knowledge was needed. The knowledge is quite superficial, as we're only looking at the amount of data and not the semantics of the data itself.

Data validations might save your skin in cases where there is an issue with the ingestion or messages are silently dropped in your infrastructure. As such, these validation are very valuable, because, what is a data business without data?

Business Assumptions

As we're now fairly certain that we have the data in our pipeline and it is flowing through correctly, we can start posing the question if our data is correct. As this requires a more in-depth knowledge of semantics of the data, this piece of alerting should be owned by the team that owns the data. At OTA Insight, that's the team that ingests it in most cases, but sometimes it's the consuming team that knows much better how it should be handled.

Business assumptions come in all flavours. To give an idea, I'll give a few examples:

We assume the header of the file always has a certain date format, but we're not sure if this will transfer to newly received files.
We assume 2 values always add up to a third value. If this is incorrect, our next calculations cascade the faultiness.
Are the values in a certain row correct according to our initial research?
Are values within a certain range without too many outliers?
...

These alerts are a bit harder to implement, as they can happen at different levels of your processing. This means that each data source should have a way of surfacing these invalidated assumptions.

Preferably, these fallbacks start very broad, warning at any assumptions that are broken. As time goes by, certain assumptions can be (in)validated and removed from our net, or added as hard-fails if the assumption signals an actual error. After a while, our assumptions will converge to a realistic mirror of the actual underlying semantics.

Implementing this is sometimes hard, as the decision needs to be made to use either a warning or an error. Throwing too many or not enough errors might skew the data or impact data quality. On the other hand, having too many warnings makes it hard to filter between truth and error. As such, this should be implemented by someone who has both a good understanding of the underlying data, but also the data source.

When implemented, these alerts give your pipelines a safety net against invalid data. They also provide a looking glass into your assumptions, which can help you root out the invalid assumptions and reinforce the valid ones.

Conclusion

People that work with Big Data know that it contains chaos. Without some kind of insights, finding & lowering edge cases is practically impossible. Adding different kinds of alerts helps us engineers cluster this chaos into smaller packets that can be solved independently. The result is more manageable errors and targeted failsafes, which ultimately increases the robustness of both processors and data.

The hierarchy in alerts makes it easy to delegate certain responsibilities to different teams. It also gives an idea on what knowledge is required for certain levels of fallbacks. Lastly, it gives an idea of the implementation complexity at each depth. Depending on your data source, you might want to weigh the costs against the boons of adding a certain validation.

Why we don’t use Spark

Karel Vanden Bussche — Wed, 07 Sep 2022 09:03:59 +0000

Big Data & Spark

Most people working in big data know Spark (if you don't, check out their website) as the standard tool to Extract, Transform & Load (ETL) their heaps of data. Spark, the successor of Hadoop & MapReduce, works a lot like Pandas, a data science package where you run operators over collections of data. These operators then return new data collections, which allows the chaining of operators in a functional way while keeping scalability in mind.

For most data engineers, Spark is the go-to when requiring massive scale due to the multi-language nature, the ease of distributed computing or the possibility to stream and batch. The many integrations with different persistent storages, infrastructure definitions and analytics tools make it a great solution for most companies.

Even though it has all these benefits, it is still not the holy grail. Especially if your business is built upon crunching data 24/7.

At OTA Insight, our critical view on infrastructure made us choose to go a different route, focused on our needs as a company, both from a technical, a people perspective and a long term vision angle.

Humble beginnings

Early on you have only 1 focus: building a product that solves a need and that people want to pay for, as quickly as possible. This means that spending money on things that accelerate you getting to this goal - is a good thing.

In the context of this article this means: you don’t want to spend time managing your own servers, or fine-tuning your data pipeline’s efficiency. You want to focus on making it work.

Specifically, we heavily rely on managed services from our cloud provider, Google Cloud Platform (GCP), for hosting our data in managed databases like BigTable and Spanner. For data transformations, we initially heavily relied on DataProc - a managed service from Google to manage a Spark cluster.

Managing managed services

Our first implementation was a self-hosted Spark setup, paired with a Kafka service containing our job-queue. This had clear downsides and in hindsight we don’t consider it managed. A lot of side-developments had to be done to cover all edge-cases of the deployment & its scaling needs. Things like networking, node failures and concurrency should be investigated, mitigated and modelled. This would have put a heavy strain on our development efficiency. Secondly, pricing of running a full Spark cluster with a 100% uptime was quite high and creating auto-scaling strategies for it was quite hard. Our second implementation was migrated to use the same Kafka event stream that streamed workload messages into the Spark DataProc instances instead of the initially self-hosted Spark instance.

The Kafka-Dataproc combination served us well for some time, until GCP released its own message queue implementation: Google Cloud Pub/Sub. At the time, we investigated the value of switching. There is always an inherent switching cost, but what we had underestimated with Kafka is that there is a substantial overhead in maintaining the system. This is especially true if the ingested data volume increases rapidly. As an example: the Kafka service requires you to manually shard the data streams while a managed service like Pub/Sub does the (re)sharding behind the scenes. Pub/Sub on the other hand also had some downsides, e.g. it didn’t allow for longer-term data retention which can easily be worked around by storing the data on Cloud Storage after processing. Persisting the data and keeping logs on the interesting messages made Kafka obsolete for our use case.

Now, as we had no Kafka service anymore, we found that using DataProc was also less effective when paired with Pub/Sub relative to the alternatives. After researching our options regarding our types of workloads, we chose to go a different route. It is not that DataProc was bad for our use cases, but there were some clear downsides to DataProc and some analysis taught us that there were better options.

First, DataProc, at the time, had scaling issues as it was mainly focussed on batch jobs while our main pipelines were all running on streaming data. With the introduction of Spark Streaming, this issue was alleviated a bit, though not fully for our case. Spark Streaming still works in a (micro-)batched way under the hood, which is required to conform to the exactly-once delivery pattern. This gives issues for workloads that do not have uniform running times. Our processors require fully real-time streaming, without exactly-once delivery, due to the idempotency of our services.
Secondly, the product was not very stable at the time, meaning we had to monitor quite closely what was happening and spent quite some time on alerts. Lastly, most of our orchestration & scheduling was done by custom written components, making it hard to maintain and hard to update to newer versions.

Building for the future

It was clear we needed something that was built specifically for our big-data SaaS requirements. Dataflow was our first idea, as the service is fully managed, highly scalable, fairly reliable and has a unified model for streaming & batch workloads. Sadly, the cost of this service was quite large. Secondly, at that moment in time, the service only accepted Java implementations, of which we had little knowledge within the team. This would have been a major bottleneck in developing new types of jobs, as we would either need to hire the right people, or apply the effort to dive deeper in Java. Finally, the data-point processing happens mainly in our API, making much of the benefits not weigh up against the disadvantages. Small spoiler, we didn't choose DataFlow as our main processor. We still use DataFlow within the company currently, but for fairly specific and limited jobs that require very high scalability.

None of the services we researched were an exact match, each service lacked certain properties. Each service lacks something that is a hard requirement to scale the engineering effort with the pace the company is and keeps growing with. At this point, we reached product-market fit and were ready to invest in building the pipelines of the future. Our requirements were mainly keeping the development efficiency high, keeping the structure open enough for new flows to be added, while also keeping the running costs low.

As our core business is software, keeping an eye on how much resources this software burns through is definitely a necessity. Taking into account the cost of running your software on servers can make the difference between a profit and a loss and this balance can change very quickly. We have processes in place to keep our bare-metal waste as low as possible without hindering new developments, which in turn gives us ways to optimise our bottomline. Being good custodians of resources helps us keep our profit margins high on the software we provide.

After investigating pricing of different services and machine types, we had a fairly good idea of how we could combine different services such that we had the perfect balance between maintainability and running costs. At this point, we made the decision to, for the majority of our pipelines, combine Cloud Pub/Sub & Kubernetes containers. Sometimes, the best solution is the simplest.

The reasoning behind using Kubernetes was quite simple. Kubernetes had been around a couple of years and had been used to host most of our backend microservices as well as frontend apps. As such, we had extensive knowledge on how to automate most of the manual management away from the engineers and into Kubernetes and our CI/CD. Secondly, as we already had other services using Kubernetes, this knowledge was quickly transferable to the pipelines, which made for a unified landscape between our different workloads. The ease of scaling of Kubernetes is its main selling point. Pair this with the managed autoscaling the Kubernetes Engine gives and you have a match made in heaven.

It might come as a surprise, but bare-metal Kubernetes containers are quite cheap on most cloud platforms, especially if your nodes can be pre-emptible. As all our data was stored in persistent storages or in message queues in between pipeline steps, our workloads could be exited at any time and we would still keep our consistent state. Combine the cost of Kubernetes with the very low costs of Pub/Sub as a message bus and we have our winner.

Building around simplicity

Both Kubernetes and Pub/Sub are quite barebones, without a lot of bells & whistles empowering developers. As such, we needed a simple framework to build new pipelines fast. We dedicated some engineering effort into building this pipeline framework to the right level of abstraction, where a pipeline had an input, a processor and an output. With this simple framework, we've been able to build the entire OTA Insight platform at a rapid scale, while not constricting ourselves to the boundaries of certain services or frameworks.

Secondly, as most of our product-level aggregations are done in our Go APIs, which are optimised for speed and concurrency, we can replace Spark with our own business logic which is calculated on the fly. This helps us move fast within this business logic and helps keep our ingestion simple. The combination of both the framework and the aggregations in our APIs, create an environment where Spark becomes unnecessary and complexity of business logic is spread evenly across teams.

Summary

During our growth path, from our initial Spark environment (DataProc) to our own custom pipelines, we've learned a lot about costs, engineering effort, engineering experience and growth & scale limitations.
Spark is a great tool for many big data applications that deserves to be the most common name in data engineering, but we found it limiting in our day-to-day development as well as financials. Secondly, it did not fit entirely in the architecture we envisioned for our business.
Currently, we know and own our pipelines like no framework could ever provide. This has led to rapid growth in new pipelines, new integrations and more data ingestion than ever before without having to lay awake at night pondering if this new integration would be one too many.

All in all, we are glad we took the time to investigate the entire domain of services and we encourage others to be critical in choosing their infrastructure and aligning it with their business requirements, as it can make or break your software solutions, either now, or when scaling.

Want to know more? Come talk to us!