DEV Community: David Schmitz

The fallacy of quality vs. speed vs. budget

David Schmitz — Wed, 27 Mar 2024 12:10:10 +0000

Software Engineering folklore says "Budget, quality, or speed - choose two." But, in the context of software development, I take issue with this rhetoric, as it concerns quality. In my experience, compromising quality is not an option; it's a foundational pillar upon which effective software development stands.

The Iron Triangle of Project Management

In management, we face difficult decisions. We have a challenging timeline. We have to deliver a whole mountain of scope. We work with inexperienced people. When faced with such a dilemma, we tend to apply the old mantra: "Budget, quality, or speed - choose two", illustrated next:

The illustration shows the typical dimensions impacting the software development process.

Speed: Time is invariably of the essence. Whether it's contractual deadlines, board expectations, or market entry strategies, time constraints are omnipresent.

Budget: we always have to balance our spending to what earn in return - ROI. Overspending without commensurate returns is futile.

Quality: this is a tricky one. This refers to both the visible and invisible quality of the product. E.g., the features of the product and the quality of the code. Consider response times and memory but also aspects like the ability to update often.

Now let's consider our options. We have three dimensions - speed, budget, and quality. What happens if we turn the dial on each? If we run this experiment for development, we realize one thing: Quality cannot be negotiated!

Allow me to argue.

Disregarding Quality

During my time as a software engineer, I've come to understand that there is a direct linkage between the three dimensions. The keystone in this relationship? Quality. Reduced quality affects speed, which, in turn, impacts the budget.

Consider an engineering project developed in Java I have been involved in. The team members were more junior than expected. So, development did not meet the management's expectations. That went on for a couple of weeks until one manager came up with a suggestion:

Why don't we halt this obsession with unit testing? It's slowing down progress. Let's rely solely on end-of-sprint QA.

Ignoring quality in the initial development phase might give an illusion of speed. The development might appear to take off - see the next illustration. Initial reports may even look great. Client demos run smoothly. Your team hits every milestone (I).

However, the illusion of initial speed gains soon gave way to a cascade of issues. Technical debt mounted, manifesting in bug-ridden, inefficient code. As time progressed, the cumulative effects became apparent: increased debugging efforts, catastrophic failures, and a compromised timeline.

Take Back Quality

The consequences of neglecting quality are well-documented. For, in the late 1990s, the $125 million NASA's Mars Climate Orbiter was lost due to a simple software error.

An investigation indicated that the failure resulted from a navigational error due to commands from Earth being sent in English units (in this case, pound-seconds) without being converted into the metric standard (Newton-seconds.)

Sounds like a good case for a test.

Look at the transformation practices from accelerate. Continuous delivery is at the heart of everything. Everything depends on having good engineering practices in place, from test automation to deployment automation.

But how can we get there if we disregard quality? We can't.

Consider a proven approach for measuring engineering excellence, the DORA metrics.

Deployment Frequency: How often an organization successfully releases to production
Lead Time for Changes: The amount of time it takes a commit to get into production
Change Failure Rate: The percentage of deployments causing a failure in production
Time to Restore Service: How long it takes an organization to recover from a failure in production

At a high level, Deployment Frequency and Lead Time for Changes measure velocity. Velocity, in this case, describes how fast can we move in a certain business direction. Can ideas be experimented on in production? Can we iterate on ideas fast? Change Failure Rate and Time to Restore Service measure stability. Stability describes how our product withstands adverse effects. E.g., outages, internal errors, and some transient system failures. Is our system self-healing, or does it require manual intervention?

The fastest way to build something implies ignoring stability.

The stable way to build something implies never changing anything.

The sweet spot is a stable system, that is also able to change with high velocity.

It should be clear that each DORA metric depends on having a high-quality code base. You cannot move fast and have a high velocity if bugs haunt your code. You cannot move fast and have a high velocity if your architecture is one big ball of mud. And your system is not stable if it falls over its own feet at every opportunity.

Investing in good coding practices and rigorous testing is vital. It ensures the longevity and maintainability of the software. It enhances customer satisfaction. It saves a considerable amount of time and resources in the long run.

Going forward - Learning from the past

The worst part is that this is not a new insight. Back in 1975, Brooks argues in the seminal book The Mythical Man-Month: rushing to meet deadlines frequently results in compromising the quality of software. It is crucial to prioritize quality over speed since low quality can ultimately cause more serious problems and delays: E.g., higher expenses for maintenance and system failures. Let me repeat: 1975!

The same insight is offered in Jones' Assessment and Control of Software Risks back in 1994.

Focus on quality, and productivity will follow.

While speed and budget are certainly significant, they should never be prioritized at the expense of software quality. The overall success of a project hinges on the balance and interplay of all three aspects.

As engineering managers and lead developers, adopting a quality-first approach should always be our mantra. Our role is to create value and deliver stellar software solutions that not only meet but exceed expectations. In the end, our reputation - and our work's standing - rests on the quality we deliver.

AI - Hype, reality, and the call for more diversity

David Schmitz — Fri, 28 Apr 2023 09:14:13 +0000

Let me start by stating: I neither dismiss AI as a fad nor the astonishing progress the field and the researchers have made. Compare the old Google Translate to today's version and one sees the advances in no time. This text is only about good and healthy skepticism and being aware of the current limitations.

AI is everywhere.

Every blog post seems to be about AI.
Every video seems to be about AI.
Every new tool seems to be using AI.

With ChatGPT's public release in November 2022, things moved even faster.

The goal of this article is to take a step back and try to separate fiction from reality, and hype from actual real-world business cases.

On note: my work focuses on regulated industries in the EU. E.g., banks, and insurance companies. So, some concerns that I raise may not apply to you.

Topics we ignore

First things first. AI is a complex and vast topic. Its proponents and critiques drive the conversation from "silver bullet and panacea" to "existential risk".

We exclude the following trains of argument from the rest of the article:

Ethics: AI is tricky from an ethic-point-of-view. Consider face recognition. Innocent enough in a photo application. But may be problematic when used to find dissidents in a crowd.
Control problem: building an AI is one thing. Making sure it benefits humanity and helps us prosper is another.
Artificial General Intelligence: current AIs are rather narrow in application and scope. They can play Go. They can generate funny images. But they cannot do everything. An Artificial General Intelligence could tackle a wide range of tasks-, surpassing human capabilities. The implications would be significant and are ignored in this article.
Existential risk: The fear that "we might be to an AI, as an ant is to us" runs deep. Let's just point to Bostrom's book Superintelligence: Paths, Dangers, Strategies and to Harris' podcast Making Sense that explores this in depth.

These issues are not dismissed easily. They are urgent and important. Here we want to focus on a different angle, however. The links above should serve as a good starting point for exploring the implications.

The AI hype-train

AI has seen ups and downs since its inception in computer science research.

With the availability of giant data sets and computing power of unprecedented scale since the 2010s AI has seen an incredible boost. Methods like deep learning and deep reinforcement learning promise things undreamed of before. Andrew Ng, known for his extensive work in the field and, e.g., Coursera, wrote in 2016 in Harvard Business Review:

If a typical person can do a mental task with less than one second of thought, we can probably automate it using AI either now or in the near future. (Andrew Ng)

With the release of ChatGPT back in November 2022 the general public started to jump on the hype train. The Google trend for "Künstliche Intelligenz" (German, "artificial intelligence") shows a dramatic spike in interest since the release of ChatGPT.

And the promise just keeps growing. At the beginning of April 2023, the German IT news portal Golem published an article citing a research paper. The claim: the author of the paper was able to predict stock market movements just using ChatGPT.

According to this article, Goldman Sachs estimates that 35% of all jobs in the financial industry are jeopardized by AI.

The consulting company McKinsey compared the economic impact of AI to the invention of the steam engine.

Together these elements may produce an annual average net contribution of about 1.2 percent of activity growth between now and 2030. The impact on economies would be significant if this scenario were to materialize. In the case of steam engines, it has been estimated that, between 1850 and 1910, they enabled productivity growth of 0.3 percent per year.

Just imagine. The impact of AI might be four times the impact of the steam engine, which revolutionized the worldwide economy.

Last, but not least, research using the newest generation of GTP, GTP-4, claims to have found "sparks of artificial general intelligence".

So, this is it. AI is finally on the verge of the dominance it promised for so long. Do data and compute power equal the singularity we dreamed of (or feared)?

Over-promise and under-deliver

Herb Simon, a pioneer in AI research, predicted

Machines will be capable, within twenty years, of doing any work man can do. (Herb Simon)

This was in 1965. We must not ridicule Simon for a bold prediction. The intention is to show, that we tend to overestimate (and underestimate) the impact of trends we are working on.

Going back to ChatGPT, it is easy to see its limitations. Asking the AI to multiply two numbers results in a puzzling answer. The result of 1234 times 4567 according to ChatGPT is 5611878. Close, but wrong.

Underlying ChatGPT is a language model. Not understanding. ChatGPT does not understand multiplication, or text, or even language. It has a complex internal model, that applies to prompts and generates answers. Basically like an idiot savant.

ChatGPT is kind of this idiot savant but it really doesn't understand about truth. It's been trained on inconsistent data. It's trying to predict what they'll say next on the web, and people have different opinions. (Geoffrey Hinton)

The AI does not comprehend context, history, or abstract relations. A fun experiment is to drop any legal text - say the GPL license and then ask questions based on the license.

Again, not to ridicule, but to point out the limitations.

So, why the hype? Why is every second post on Linkein about the amazing powers of ChatGPT, and AI in general?

There are many factors. But predominately:

Marketing: If your product or service is driven by AI then it gets attention. The same happened with the terms Agile and Cloud. Agile Cloud Foobar means more customers than just Foobar.
Lack of understanding: AI is complex, as are GANs, as is Machine Learning, as are LLMs. Judging the capabilities and their limitations is difficult if not impossible, without understanding how these methods work.
Media and the news cycle: Media works by grabbing attention. Every tiny step of research results in a news report blown of out proportion.

Let's go back to the news headline from Golem above. We find some interesting detail if we read the underlying research and references carefully:

He used ChatGPT to parse news headlines for whether they’re good or bad for a stock, and found that ChatGPT’s ability to predict the direction of the next day’s returns were much better than random, he said in a recent unreviewed paper. (CNBC)

The paper has not been reviewed. The results have not been challenged at the time the article was released. This seems like reporting on an arbitrary blog post. The report lacks diligence in favor of attention-grabbing. ChatGPT is the hot topic. Predicting stocks with it sounds magical. So journalistic duties take a step back.

The point is: we must look behind the ads. We need to read and analyze the references and the actual data.

"Economic"-AI

Instead of falling for false and overblown promises, we could go a different route, which may be called economic AI. The following illustration visualizes the underlying idea.

We use a narrow and focused process that starts with first having a good business-driven reason for adopting AI methods and tools (primary business strategy). But there is even one more essential step even before that: getting your data in order and accessible, i.e., data Kung-Fu.

Data Kung-Fu relates to the idea of making the business data flow between stances, being accessible, agile, and bendable. We often see organizations building either undiscovered data-silos without any APIs or easy methods of access. Or organizations go the opposite way and start building expensive ungoverned and in the end useless data-lakes. Both tend to go nowhere. Instead, a "data-as-a-product" approach leads to more promising results. Whatever we do: getting the data in shape is the foundation of any AI strategy.

Now we can up-skill the people in the organization. Become familiar with data-methods like deep learning, tooling, and frameworks. This is essential to judge the applicability and results of different AI approaches.

Next, we build a lighthouse solution. This has to be a real business-driven and impactful first vertical slice. The goal is to prove AI is an asset for the organization. Everybody can build an image-labeling application today. But building a machine learning-supported loan application tool is a different animal. Governance, security, and compliance, all come into play here and need to be solved. Only then do we know what we are dealing with.

Finally, we are in the improvement and growth loop. We understand and measure the impact of our solution. We learn and improve our setup and we mature and scale the usage of AI within the organization.

In essence, this leads to narrow and targeted applications of AI to a concrete business domain. This can range from automatically identifying customers in KYC processes to automatically processing incoming documents (IDP). None of these use cases will get you a page one headline. But each will have a significant ROI.

Challenges: Trust and bias

Although we have found a replacement for the elevated promises of AI in the form of economic AI, two central concerns remain

the trust in the results of an AI solution
the bias sometimes found in an AI solution

Both are related (how can we trust a biased AI?), but need to be addressed differently.

Human-centric-AI

The core of the idea is to move towards a human-centric-ai.

If we want an AI solution to be accepted, it has to meet three base requirements:

we need the solution to be reliable
we need the solution to be safe
we need the solution to be trustworthy

This idea is visualized in the next diagram.

The human-centric-AI supports full automation, but in the end, leaves control and verification to the human. The human always stays in charge.

This is what we call the AI as a valet analogy. Instead of seeing eye to eye with an AI, or even regarding an AI as a human partner, we reduce AI to what it is meant to be: a very powerful tool, but a tool in the end.

Imagine our phones to be equipped with a multitude of narrow specialized AI, each excelling at a special purpose. Maybe monitoring our health, optimizing our calendar and mail, and improving the photos we take.

Maybe this sounds banal or boring. But at least it seems realistic for now.

Bias and diversity as a solution

This leaves the second, related, challenge: Bias. This may turn out to be the most important near-term challenge for AI and computer science as a field.

There have been many shocking reports from biased AIs leading to terrible results. Let's remember the gender bias in Google's picture search, or the racial bias in the same engine. Both have since been addressed, kind of, but rather haphazardly instead of fixing the underlying core problem: the data you feed into an AI determines the results.

So, what can be done?

A Wired article from 2018 states that less than 12% of AI researchers are not male.

Whether or not that ratio is correct is not important. But researchers and practitioners in the AI field do not reflect society as a whole. If AI transforms society as profoundly as we all expect, then this is a very problematic situation.

But this is not only AI. A study from the German "Gesellschaft für Informatik" ("Society for Computer Science") came up with distressing results.

The following graph illustrates the interest of boys and girls in computer science over time.

Both groups start with a similar level of interest and curiosity. But over time this changes dramatically. The study suggests many reasons, for example the stereotypical "IT-nerd". If we watch any IT related show, we immediately see the problem.

And let's not get started on the under-representation of groups like, e.g., trans-people.

The solutions to the lack of diversity in AI and computer science are difficult and are being worked on. A short-term mitigation seems unlikely, however. This implies, that we have to be very aware of bias in our data sets and AI solutions. If we build an AI-driven product (or any data-driven for that matter) we must set up tests to discover any unintentional bias before using the solution for real-world problems.

Summary

Let's be clear. AI and its sub-disciplines will transform society. New jobs will appear. Old jobs will disappear. We will work and live differently.

But not overnight.

Things are not revolutionized overnight often. Evolution instead of revolution is more often than not how things progress.

We need to educate ourselves on the tools and methods around AI to understand their impact. Without understanding we are left to believe and trust the snake-oil vendors and the fearmongers. Let's look behind the media outbursts, let's dissect the actual research and not fall blindly into cheap scams.

If AI is one of the things changing the world in a fundamental way over the next years (decades, whatever), then we need to pay attention. Diversity in research and data is of uttermost importance. Bias must be avoided. Trust must be built. Otherwise, AI won't be embraced and not live up to its promise.

Title picture: Photo by Andy Kelly on Unsplash

The cloud-agnostic-architecture illusion

David Schmitz — Fri, 19 Aug 2022 16:39:51 +0000

Whenever I speak with clients about their IT strategy, vendor lock-in seems to be a very urgent and important topic. Especially regulated organizations like banks and insurance companies try to avoid lock-in at all costs. This short article explores this topic and offers some suggestions for dealing with it.

Vendor lock-in

First, we need to define what we mean by "locking into a vendor". Locking into a vendor usually means, that our systems depend strongly on a vendor's capability. Switching vendors gets difficult, or prohibitively expensive. Lock-in leads to one-sided advantages for the vendor. Pricing conditions and strategies cannot be freely negotiated anymore, e.g. In addition, some industries are required to handle lock-in explicitly, e.g., financial institutes. They need to have a plan in case a vendor shuts down or deprecates products. Some form of exit-strategy.

"The EBA Guidelines require institutions to have a comprehensive, documented and sufficiently tested exit strategy (including a mandatory exit plan) when they outsource critical or important functions."
-- Source

Let's consider an example: using a specific database like PostgreSQL.

PostgreSQL is a SQL-based, relational database, similar to MariaDB or Microsoft SQL Server. Each product offers different capabilities. PostgreSQL, e.g., offers strong support for handling CSV data and extended support for regular expressions. Microsoft SQL Server on the other hand offers Views that update automatically.

We can either use these exclusive features or not. Either we depend on PostgreSQL's handling of CSV data or we don't. In the first case, we locked-in to this feature. In the second case, we need to build this PostgreSQL-specific feature ourselves.

Many developers should be familiar with this situation. One standard tactic to reduce lock-in like above is to use a facade. A quick look at the Java stack shows solutions like JMS for messaging or JPA for persistence. Each with the intent to abstract the actual underlying technology away. This would, in principle, allow developers to switch between databases without changing code. The next illustration shows the idea.

A service uses JPA to access a database. It uses SpringBoot with PostgreSQL by applying config A:



spring.datasource.platform=postgres
spring.datasource.url=jdbc:postgresql://localhost:5432/test

Switching to H2, for example for testing, requires no code changes. The service applies config B.



spring.jpa.database-platform=org.hibernate.dialect.H2Dialect
spring.datasource.url=jdbc:h2:mem:testdb

We do not need to touch the application code. We point to a different SQL flavor spring.jpa.database-platform and to the database instance location spring.datasource.url and we are good to go.

In a nutshell, we insulate our code from the vendor-specific capabilities by putting a facade in between, e.g., JPA. Again, we need to stress this. We give up on any specific database advantage if we use a facade and commit to using only the common features. We cannot rely on PostgreSQL-exclusive features if we want to be able to switch by configuration only.

Let's apply this concept to the cloud!

The cloud-agnostic architecture

Today, most IT companies have some form of cloud strategy. Choosing a provider is not trivial, especially in regulated industries (banks, insurances, healthcare). SOX, GDPR, HIPAA,... all must be considered when choosing a vendor.

But work does not stop there. Companies need to handle the topic of vendor lock-in. Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure (Azure) offer comparable services and capabilities. But if we look closer, differences emerge. For example, each provider offers options for running containerized workloads. But each does this a little different, with slightly different ways to configure the underlying runtimes and especially different SLAs. Compare Google's Cloud Run SLA to Azure's Container Instances.

We might wonder: "What happens if Google raises the prices for their cloud offering? What happens if Azure deprecates a database product we are using?"

A common proposal seems to be: "Let's build our platform cloud-agnostic! Let's use Kubernetes as the insulation layer". The idea seems easy enough. We use Kubernetes as the abstraction layer and depend on no provider-specific tooling.

We build all services as containerized workloads, i.e., OCI images - sometimes called Docker images. We deploy these to the Kubernetes product offered by the cloud vendor. Whenever we need some capability, containers are the answer. This insulates our applications from the vendor. In principle, we could switch providers as long as Kubernetes is available.

Easy? Maybe not.

The Kubernetes Datacenter

We use OCI images for everything and rely on nothing else.

need a database? PostgreSQL in containers
need object storage? Minio in containers
need monitoring? ELK stack in containers
need messaging? RabbitMQ in containers
need XYZ? XYZ in containers

This leads to something illustrated by the next diagram.

We need to host many infrastructure components ourselves. If we cannot depend on the cloud provider's IAM solution, then we have to roll our own Keycloak.

Switching platforms

Switching from AWS to Azure means switching from AWS' EKS to Azure's AKS. The rest can be moved as is, from AWS to Azure without any big impact. How hard can this be? We are only changing one letter, after all ;).

One could even say, that we could switch providers, as long as we can spin up a Kubernetes cluster ourselves. A couple of VMs should be enough. This should be possible on every platform.

So, all is well, right? As we'll see things may be more complex than that.

The illusion of being cloud-agnostic

The idea is so alluring. And if it worked as described, then this article would not exist.

Revisiting the cloud-sales-pitch

One reason for companies to move to the cloud is to reduce the engineering effort. Using a SaaS database or a SaaS message broker or a SaaS Kubernetes is great for various reasons:

We can reduce our operational effort. The vendor takes care of patching, updating, and so on.
We can focus on our product instead of building an internal engineering effort. Or, rephrasing, how does maintaining a load-balancer help our business?
We can move faster and more efficiently. The provider scales up and down, new products can be used by triggering the cloud vendor's APIs.

In a nutshell, we refocus our efforts and concentrate on the things that are competitive advantages. Our products. Not our self-written HTTP load-balancer.

If we examine the cloud-agnostic approach, the implications show. We build a custom datacenter, instead of leveraging the cloud provider. Instead of using the SaaS capabilities offered by the cloud vendor, we create our own - often worse - imitations. Instead of reducing the engineering effort, we increase it.

All the highlighted components must be deployed, patched, and maintained by our engineers. They work on patching a database instead of building the next great feature.

But, let's say that we are fine with all this extra effort. Let's say that building our development landscape on GCP using only VMs and Kubernetes is acceptable. Including hardening and securing Kubernetes, which in itself are major tasks.

Even after investing all this effort, we are still not cloud-agnostic. Let's discuss some examples, starting with the obvious and then moving to the maybe-not-so-obvious.

Bare capabilities

Looking at the distribution of datacenters, the Global cloud does not seem so Global after all. Depending on our business, this may or may not be an issue. For example, if we are building a system for a German bank, then we have to meet GDPR requirements. That means we are not free to use any capability worldwide, just like that.

If e.g., we want to serve clients across Africa, choosing a provider is not easy. AWS and Azure have some presence in South Africa, GCP offers some CDN capabilities, but that's that.

So, building an architecture around the available datacenters is a leaky abstraction. Failover, resilience, and latency - all depend on the location of the datacenters. If one provider offers fewer locations than another, then we are locked in. We need to be aware of this fact and consider its impact when moving from one cloud to another.

If we require special hardware, e.g., dedicated servers, we will find out pretty quickly, that limitless scale may be a problem, too.

Networking

When it comes to networking, the different capabilities of the cloud vendors are often overlooked. Again, on a high level, the vendors seem similar if not identical.

Take for example the concept of a Virtual Private Cloud, short VPC. Unlike AWS and Azure, GCP's Virtual Private Cloud resources are not tied to any specific region. They are considered to be global resources. However, a VPC is part of a GCP project. A project is used on GCP to organize related resources, e.g., everything an application would need. All subnets within a VPC can communicate unless forbidden by firewall rules. If we want to control communication centrally, we can introduce a so-called Shared VPC, where multiple projects leverage the same VPC. This architecture is not easily transferred to other providers.

The data gravity-well

There are at least two aspects to consider when talking about data and the cloud: cost and data-affinity. Let's discuss cost first.

Take for example Azure and their data transfer pricing. Moving data into (ingress) an Azure data center is usually free of charge. We can upload terabytes of data without paying a single cent.

Getting data out of a data center (egress) can be expensive on the other hand. Suppose we have one petabyte of data stored in an Azure datacenter and want to move that date someplace else. Using the data transfer pricing page we end up with a five-digit figure (Full disclosure, this is a simplified calculation. Cloud vendors offer better and more cost-effective ways if we want to move data of this size. Pricing calculation on clouds is super-complicated and requires a Math Ph.D.).

The point is: that once data is in a specific data center, getting it out can become expensive quickly and should be planned for.

The second aspect we want to discuss is the data-gravity or data-affinity. The idea is that applications tend to move close to the data they need. If the customer data is in a GCP Spanner instance, then chances are that the applications will run on GCP, too. Sure, we could store data in GCP and have our applications hosted on AWS. But such an architecture is loaded with downsides. Security, cost, latency, and so on may make such an approach undesirable.

Security

Security is the backbone of any non-trivial architecture, especially in the cloud. Every major vendor takes this seriously. AWS, Azure, and GCP are all in the same boat here. A security breach at either vendor negatively impacts trust in all the other vendors, too. This understanding is what Google calls shared fate. It is the cloud provider's most important job: to stay secure and help customers build secure systems.

However, every vendor approaches security slightly differently. Sure, they all have an IAM approach. They all support encryption. They all allow some form of confidential computing.

But if we look closely, we find differences. The way a secure cloud architecture is setup uses the same concepts (access control, perimeters,...) but how we implement them is completely different. Let's consider two examples: Key management and Privileged Identity Management.

All providers support some form of bring-your-own-key. We can use our keys for de- and encryption. These keys are usually stored in the provider's data center. But what if we want to maintain control of the keys and not store them in an external datacenter? That is where External Key Management comes into play. Using this approach, we can control the location and distribution of the keys. At the time of this writing, the GCP offering is superior compared to AWS and Azure, offering capabilities like Key Access Justifications.

Privileged Identity Management, short PIM, allows a user to carry out privileged operations without having "root" or "admin" privileges all the time. In a nutshell, we elevate the user's privileges for some timeframe. "Ok, you can dig through the database for the next 15 min to debug this issue". Again, at the time of this writing, only Azure offers this as part of their cloud. AWS and GCP require additional tooling.

These capabilities are subject to change. Vendors tend to fill gaps in their portfolios. Nevertheless, we need to be aware, that each cloud vendor has different capabilities especially when it comes to cross-cutting concerns like security.

In the end, the specifics don't matter that much. The security capabilities leak into all other aspects of our system landscape.

Infrastructure-as-Code

It is considered good practice to automate infrastructure environments using tools like Terraform or CDK. This helps reduce configuration drift and increases the overall quality of our infrastructure. However, the capabilities of the underlying cloud provider tend to get baked into the infrastructure code.

Moving infrastructure code from GCP to Azure effectively means rewriting everything. Sure the concepts or the high-level architecture may be similar. But for the code, this is similar to moving an application from Java to Go-Lang. In Terraform switching from a GCP provider to an Azure provider means throwing everything away.

The illusion

A true cloud-agnostic approach is an illusion at best. Sure, we can move our OCI-compliant images (read "Docker images") from one Kubernetes environment to the next. But this is only one tiny piece of a system architecture. And let me stress, that the capabilities of Google's Kubernetes Engine and Azure's Kubernetes Service are not the same.

Remember JEE? Same promise. The sales pitch was: We can build an enterprise Java application, package it as an Enterprise Application Repository (EAR) and then run it on JBoss, WebSphere, Weblogic. Only that this was an empty promise with lots of technical challenges. Build once, run everywhere? More like, build once, debug everywhere.

So what can be done?

Strategies for dealing with lock-in

Know the scope of the issue

The first and most important thing is keeping inventory of what we are using. We need to know what products are in use and why if we want to make informed decisions and judge trade-offs. This must be automated and enforced right from the start. Cleaning up after the fact is very difficult and probably does not lead to a complete inventory.

Labels, tags, and comprehensive automation are key.

By the way, keeping inventory is important regardless of the lock-in discussion. Losing track of what we are using is super easy in the public cloud.

Prefer loosely coupled architectures

A few architectural guidelines can lessen the pain of vendor lock-in.

We can mitigate lock-in by following first principles. Loosely coupled architectures are popular for a reason. We can build systems that allow replacing single components if we follow this principle. E.g., if we move our dependency to GCP's Pub/Sub to a specific component, moving to AWS requires replacing that component with a new AWS SNS version.

The Messaging Facade is the only direct dependency of our Vendor agnostic service. Moving from GCP to AWS means building and using a different adapter. The service itself need not change.

We build facades whenever a hard dependency on a specific product is needed. This keeps the effects to a minimum when moving between providers. The ideas outlined by the twelve-factor methodology are very useful in this case. But, and this is a big one, we need to be aware of the underlying SLAs, which will impact our design. There is a big difference between 99.9% and 99,95% uptime. Again a potential lock-in which not visible in the code.

We can apply the same ideas if we consider the data architecture. The providers offer mostly comparable solutions for basics like SQL databases, document stores, or key-value use cases. Risk is reduced, if we use standard databases without fringe technology. We can be pretty sure, that we can find something like MongoDB, something like PostgreSQL, something like Kafka on every major cloud.

This is even possible for networks. We can design our network architecture in a way, that allows it to be transferred between providers. We need to be aware of the differences, e.g., the way VPCs work. But the other building blocks, like gateways or subnets, are quite similar. As outlined above, the specific capabilities will leak, but the impact will be reduced if we follow these suggestions.

Again, we need to read the SLAs. We need to be aware of the restrictions and advantages of the products. Not every SQL database supports 100 TBs of data. Not all load-balancers are equal.

Strategic and tactical lock-in

If we want to leverage the cloud provider's capabilities, then there is no way around lock-in.

We want a reduced engineering effort.
We want higher quality and security.
We want innovation.

So, we approach this strategically.

We identify the areas where lock-in must be kept to a minimum. We use the facade-approach outlined above. We only use products, that have corresponding counterparts on the other platforms.

If we need a SQL database, we choose GCP Cloud SQL instead of GCP's Cloud Spanner. Cloud SQL is a managed PostgreSQL or MySQL. Similar to Azure Database for PostgreSQL.

The same holds for the runtime. If we use containers, we could use Google's GKE because a comparable product exists on AWS with Amazon EKS. We could even use the serverless Google Cloud Run, because it is based on KNative, which we could port to Azure or AWS, too.

On the other hand, maybe we could fulfill requirements by using and depending on a product that is not available on other platforms. Take for example Google's BigTable. It is comparable to Amazon's DynamoDB. Comparable but not identical. Should we avoid using BigTable in this case? Maybe not. We can ask how difficult rebuilding the component using BigTable to work with DynamoDB would be. We should try to put a number on this. "We need 1 Sprint to move from BigTable to DynamoDB". This objective number allows us to make a well-informed call, on whether or not locking-into BigTable makes economic sense.

Again, having a good inventory is key. We cannot make these judgment calls if we are not fully aware of who is using what.

Create the exit-strategy up-front

This is the most relevant point, even if it appears trivial. We need to build our exit-strategy first. Most of the points raised above must be tackled right from the start. Otherwise, the suggestions may prove very expensive or even impossible.

Building an exit-strategy with the back to the wall is not a good place to be. We need to consider our options carefully. We must start by asking "What happens if Google shuts down its cloud business within the next 6 months?" and define necessary actions. Having a ball-park estimate of the lock-in is better than running with no idea at all.

And we must not forget to revisit and update our exit strategy regularly. Maybe we are using new products or the vendor's introduced new capabilities. These must also be handled by our strategy. Otherwise, the exit-strategy will be shelf-ware. Outdated the moment it is written.

Summary

As we saw, building a fully cloud-agnostic system is difficult if not impossible. And even if, the economics are questionable. Dealing with lock-in strategically makes more sense. We need to be aware of lock-ins and the associated costs. A good and automated inventory system is a prerequisite. Nothing is worse than finding out by accident, that an application cannot move from AWS to Azure because of surprising dependencies.

There are ways to build systems that support moving between providers.

Build a comprehensive inventory

First and foremost we need to make sure to know what we are talking about. Having an automated inventory is essential. We need to get this right, before anything else.

Focus on common capabilities

If we do not need some special feature of Cloud Spanner, then let's not use it. If the capabilities of PostgreSQL meet our requirements, then maybe we should go with PostgeSQL.

Prefer loosely coupled architectures

We mentioned facades and adapters. Nothing to write home about. But mature and good patterns leading to a good architecture.

Run experiments

We may not know, what moving data means. We may not know what switching from Pub/Sub to SNS implies. Running experiments and trying things out is the only way to clarify these questions. Ideas are fine, but running code is the way to go.

Be strategic

Think lock-in through before you get to it. Understand the options before you stand with your back to the wall. Choose lock-in based on numbers. What is the cost of lock-in? What is the opportunity-cost of not-locking-in and building a custom solution?

None of these suggestions is a silver-bullet. None will avoid lock-in. We want to use the cloud provider's products. But we want to be in a flexible position. We want to be partners that see eye-to-eye.

Being realistic about lock-in and its implications is more useful than falling into the snake-oil pit of easy cloud-agnostic solutions.

There is more to this than "use Kubernetes and you are done".

Celebrate quitting

David Schmitz — Tue, 05 Apr 2022 13:17:53 +0000

Companies celebrate when people join. They post on LinkedIn. They post photos of great swag. They send introductory mails to co-workers.

All what is missing is fireworks.

But, sometimes co-workers quit.

People quit their company for various reasons. Some bad, like terrible projects or leaders. Some quiet good. Maybe a new, exciting opportunity in a different part of the economy. Maybe, the person wants to try something new.

Who knows?

Regardless of the reasons, we - as leaders - should treat quitting the same way we treat joining.

In an open, honest, and transparent way.

Otherwise, people will fill this information vacuum with rumors. And we can be sure, that these rumors will be worse than the true story.

My suggestion is to - at least - write a short mail to the co-workers.

Talk about the person's growth and about the person's impact on other people. Remember the successes, projects and challenges. Recall some funny anecdotes. But be careful, that this is indeed funny and not embarrassing.

And do not forget to thank the leaving person. Don't resent that you are losing a colleague. If you are sad, that a person quit, then this is a testimony to the quitter.

Don't be happy because you are losing a valuable colleague. But be happy for that person, because her journey continues. And remember: it is a small world. Leavers may come back later.

So, let's make sure the door stays open.

Dealing with data in microservice architectures - part 4 - Event-driven architectures

David Schmitz — Sat, 02 Apr 2022 19:47:47 +0000

Microservices are a popular and widespread architectural style for building non-trivial applications. They offer immense advantages but also some challenges and traps. Some obvious, some of a more insidious nature.

This series describes and compares common patterns for dealing with data and dependencies in microservice architectures. Keep in mind, that no approach is a silver bullet. As always, experience and context matter.

Four different parts focus on different patterns.

Sharing a database
Synchronous calls
Replication
Event-driven architectures

The previous article looked at integrating microservices using change-data-capture and related approaches.

This final piece introduces event-driven-architectures (EDA). A technique for integrating microservices and handling data in general. Keep in mind that entire books have been written on this topic. So, we will only scratch the surface and try to convey the general ideas. You can find a set of resources that will help you dig deeper into these topics at the end of this post.

Event-driven architectures - in a nutshell

Let’s use a motivational example to drive the discussion. This time we do not use some financial context. We use an example familiar to most engineers: the coffee shop. Consider the exchange illustrated by the next diagram.

A customer orders a coffee (I). The waiter receives the order and asks the barista to prepare this coffee (II). Now the barista brews the ordered coffee (III). After brewing, the barista returns the coffee to the cashier (IV). The cashier in turn hands it over to the customer (V). The cashier can serve the next customer (VI) and the customer can finally sit down and enjoy the coffee (VII).

This sounds like a straightforward approach. But looking closer, the downsides become clear. Let’s examine two examples.

Many people enjoy their coffee in the morning. They fetch a hot brew on their way to the office for example. This in itself is not a problem. But, picking up new coffee orders is fast, maybe a couple of seconds. Brewing coffee is slow, alas. This can take minutes. So, every day around 8 a.m. our coffee shop gets swamped with coffee orders. The barista struggles to keep up. We cannot meet our customer’s demands because of the hard coupling between ordering and brewing coffee. The disgruntled customers will probably go to another coffee shop in the future.

Or, what happens if the barista burns herself and cannot brew coffee for the time being? Well, since all orders sit in a line the customers have to wait until the issue is resolved. The cashier cannot pick up new orders until the barista starts working again.

The asynchronous coffee shop

Now, what happens if we do things differently, as illustrated by the next diagram.

Again, the customer orders some coffee (I). The cashier picks up the order and asks the barista to brew the ordered coffee (II). Instead of waiting for the beverage, the cashier asks the customer to take a seat (III). This allows the cashier to serve the next customer immediately (IV). In the meantime, the barista brews the ordered coffee (V). When finished, she asks the waiter to serve the beverage (VI) and starts working on the next order. Finally, the waiter serves the coffee to a happy customer (VII).

The important point is that at every step, each actor (customer, cashier, barista, waiter) does not wait for more than necessary. Each actor hands off to another one as soon as possible. This allows the actors to pick up new work (orders, beverages) as soon as possible.

Event-driven microservices

Let’s see how this relates to microservices. Consider the following diagram.

Following the advice given in first article of this series, the two microservices, A and B, use their own database. Each service offers an independent API and can run autonomously.

What happens if a caller orders a coffee by POST-ing to /orders on microservice A (I) - as in the following illustration? After processing this order, microservice A emits a Coffee Ordered (II) event by publishing it (III) to a queue (Q).

Queues are part of a middleware product used to store and forward events that senders publish. Clients interested in these events subscribe to the queues. The middleware forwards the events to the subscriber.

Example tools are

In essence, this design informs any other interested service of the fact that another coffee was ordered.

Note the details here. We are talking about an event that happened in the past - an unchangeable fact. The coffee was ordered, that’s that. Nothing can be done to change this. We can cancel that order. But that would not change the original fact. Rather a new Coffee Order Canceled event would be needed.

Events encapsulate state changes. In our case, the event might look like this



{
  “eventId”: “31da4a50-06a5-4dec-81b2-9390862bd8d5”,
  “eventType”: “Coffee Ordered”, 
  “payload”: {
       “product”: “Flat White”,
       “servingSize”: “Large”,
       ...
  }
}

Implementation details aside, most approaches define events similarly:

eventId: a unique identifier for the event
eventType: the kind of event represented. The event type defines the semantics of the event. We expect different behavior on Coffee Ordered in contrast to a User Onboarded event.
payload: the actual data that this event encapsulates. The concrete data depends on the event-type.

An event may have additional attributes, like the type of aggregate or business object it refers to. But since we only want to skim the topic of EDAs, we’ll not go into those aspects. Also, we do not go into topics like versioning, proper validation, schemas, and so on. Refer to the extended references below to learn more. Let’s agree that event design is at least as complex as API design - because that’s what it is, an API.

Going back to our example. Microservice A has published the fact that a customer ordered a coffee. The next diagram shows how microservice B receives and reacts to this fact. B subscribes to events of the type Coffee Ordered (I). When A publishes the event, B receives the event and can react to it. In our case, B could start brewing the ordered coffee and store this information in its private database (II).

After B has finished brewing the coffee, it publishes a new event Coffee Brewed. The process is the same as for microservice A. The following illustration shows the queue, which now contains both events: the Coffee Ordered and the Coffee Brewed. Other services could use these events to trigger processes like serving the coffee or payment.

Note, that the services must remember which events they already received and processed. This is also out of scope for this article but should be mentioned nevertheless. The whole area of at-least-once-delivery is covered by books and articles.

With this basic example in place, let’s turn to some of the implications.

Implications of Event-Driven-Architectures

To understand the system-level implications of adopting an EDA, let’s refer to the example above and see what an EDA means for our coffee shop. We’ll start with more or less obvious technical implications. Afterward, we’ll look at implications that impact the business side, too.

Keeping cracks from spreading

We have two services in our example. One for ordering a coffee (A) and one for brewing coffees (B). Suppose service A fails. We cannot submit new coffee orders. In a classic synchronous design, this would mean a stand-still. No new orders, no coffee brewed.

But in our event-driven case service A and service B are independent. Service B can continue to brew coffee for all already submitted orders, even if A fails.

This means that customers already expecting coffee will get coffee. A tremendous win! While we serve customers, service A can be fixed without impacting any other business area.

This approach improves the resilience of our system. We can survive partial failures and allow failing parts to heal and come back into service.

Elasticity

If we decouple systems using events, we are also able to design elastic systems. We call this elasticity: a service scaling more or less transparently with the amount of work.

Imagine a shopping mall. If many customers shop at the same time, then we need more cashiers. But if only a few people are shopping, we might get away with a single one.

The same is possible for our microservice.

Picking up a coffee order takes a couple of seconds, but preparing coffee might take a minute or two. So, we can use a single instance of our microservice A, that accepts the orders. If there are only a few orders in the order queue, we can rely on a single instance of microservice B responsible for brewing coffee.

But as the number of orders increases, we are able to scale microservice B, as illustrated by the following diagram.

This is called horizontal scaling. We increase our scale by adding more workers instead of more CPU to a single worker - vertical scaling. See this article for more details.

Besides, the queue-approach protects microservice B from getting overwhelmed by the workload. Since the architecture allows microservice B to indicate whether or not it is ready for more work, we get throttling for free. Service B decides if it is ready to pick up new orders, or not.

Being able to scale elastically must be designed into the services. We don’t go into further details here but refer to this document for more information.

Dealing with late-comers

Things are clear if our system landscape is up and running. If a coffee is ordered we inform the world about this fact and things keep moving.

But what if a new service starts or an existing one re-starts? How would that service know which coffees were ordered? How would that service know about the state of the world?

One way could be to keep all events that were ever emitted. If a new service starts, it could read all events from start to finish and finally, it would know the state of the world.

Another approach could be to allow newcomers to query the other systems.

"Order system, please tell me about all orders". Or in tech-speak: GET /orders. Adopting a protocol like AtomPub may be a good idea in this case. Whichever approach we choose, the matter is more complicated than it might seem at first.

The newcomer cannot "just" consume the events.

If must remember (somehow) which events already led to side effects. E.g., which orders were already served. And depending on the number of events, re-reading every single event might not be feasible.

The point is, that we have to deal with this question right from the start. Designing this into a system after the fact can be a very daunting task.

Complexity and error handling

We already got an idea that EDAs, with all their advantages, introduce complexity and new error scenarios. In non-EDA systems, errors might result in exceptions. These can be caught and handled.

Not so if we rely on events. We have a system without a complete call stack.

Suppose service A submitted a Coffee Ordered event. But service B never received it - maybe service B is broken, who knows. Service A has no direct way to check what happened with the order. We could monitor the queues and check any lagging events. We could set up a “Business Probe” which checks if every order is picked up and served within 5 minutes. And so on.

The point is: this has to be designed into the services. Increasing complexity. Making debugging harder.

Let’s look at another scenario. What if service B received the same event twice?

Most messaging systems offer at-least-once delivery. In a nutshell, this means, the messaging system guarantees the delivery of every message. But due to some complex details, that I will gladly skim over here, most systems do not guarantee exactly-once-delivery.

This means that our services must be able to deal with this. One way is idempotency. If a service is called with the same request, then the result is the same. We could send the same order 10 times, but still only serve a single coffee.

Again, our systems must be designed to deal with this. Again increasing complexity.

However, one could argue, that EDAs force us to deal with these cases explicitly. What do I mean by that? What does it mean, if we call another service using, e.g., HTTP POST and we do not get any response?

Some questions could be:

did the other service get our request, but fail to answer?
was our request processed or not?
do we have to resend the request?

But these cases are "hidden" in the code. They are not part of the architecture. Events allow us to make this hidden behavior explicit.

Behavior as a first-class citizen

Finally, let’s look at the impact on design and especially on non-technical people. If we rely on events and architect our systems using events, then we make the processes explicit and understandable.

You do not have to be an engineer to understand what a Coffee Ordered event means.

This allows us to discuss our systems architecture with people that are not engineers. We pulled the usually hidden dynamics of our systems to the surface. This is a huge advantage when discussing changes to processes or use cases.

“Whenever a coffee finishes brewing, a sound should play informing the staff”

We can see the events flying around.

We can extend our systems without major interruptions. We can add new consumers to the systems landscape that react to events and trigger new business processes. These are some of the reasons that methods like Event storming and Event modeling are pretty successful and popular.

Summary

An event-driven architecture is a very powerful way to design a system landscape. It allows us to design truly independent systems, that are resilient and elastic.

Designing processes to fit this asynchronous approach is often very beneficial and - using a method like Event Modeling - not as hard as it might seem.

But we have to be aware of the consequences. New patterns emerge. Dealing with errors is different. And we did not even discuss eventual-consistency...maybe I will share my thoughts on this at a later time.

Keep in mind, that refactoring a landscape into an event-driven one is possible. And it need not be a big-bang. We can go one system at a time.

End of the series - finally

This took longer than expected. But here we are. Four typical approaches for dealing with data and service dependencies. There are many more, for sure. But we have to start somewhere.

I started the series back in the day based on an actual conversation with some client's architect. There was so much confusion. Everything was either super-easy "let's just do replication" or impossible "we cannot use events, everything must be super-consistent everywhere".

Turns out, that nuance matters. Context matters. Nothing is always a good solution.

Here is a TL;DR of the approaches for the lazy reader:

Sharing a database: easy to start with but can lead to complex and opaque inter-dependencies between teams and services.
Synchronous calls: easy to start with and familiar to most engineers. This can lead to a fragile web of services without any resilience or possibility for graceful degradation.
Replication: Good approach for refactoring a landscape into autonomous systems. Data governance may be a challenge as can be the volume of replicated data.
Event-driven architecture: a proven and flexible architecture for microservices. This can lead to resilient and elastic landscapes that capture business processes effectively. You must be willing to learn new patterns and rethink your design. Especially error handling requires some thought.

I cannot state a clear winner. As mentioned above. Context matters. What works for one project, might not work for another.

Further material and things to check out

More on event-sourcing, but also relevant for EDAs:

On Event-driven architectures:

Enterprise Integration Patterns
Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions, by Hohpe and Woolf

On system resilience and design:

If you want to go down the rabbit hole of exactly-once-delivery:

And if you have time, I ranted about events, event-sourcing at Devoxx a while back. Watch it here at your own risk.

Please share your experiences and approaches. I am keen to learn about different ways to tackle these problems.

Going Cloud Native - The platform team

David Schmitz — Tue, 14 Dec 2021 09:24:09 +0000

At the beginning of November a client asked me to join their Cloud Panel and talk about cloud transformation. This article is based on that presentation. You can find the slides on Slideshare.

This is the second of a two part series. The first one probes the questions around the organisation and effective collaboration. Here we explore how we can handle the rising complexity and manifest learnings as internal products using a platform team.

Remember: Efficiency

Previously, we argued that the reason to move into the public cloud is to reduce complexity. We want to focus on the things that make us stand out as a company. We want to innovate fast and build better products of higher quality. We want to experiment.

We cannot do that, if we are stuck maintaining and patching our own Kubernetes cluster all day.

The obvious option is to reduce complexity by replacing such non-essentials with products. E.g., we use Google's Kubernetes Engine instead of maintaining our own Kubernetes.

This frees up resources and allows us to be more efficient.

The DevSecOps-Full-Stack-Rockstar-Team

Ok, so we are running on the cloud. We are building efficient engineering teams. Practices like Scrum suggest building autonomous teams. Teams, that can own their respective area or product.

Being self-reliant implies that we need a lot of skills in our teams. Let's create a shopping list for a REST-ful service running on the Google Cloud Platform

We are building an API. Thus, Kotlin, Ktor and related backend technology
Containers? Sure! Kubernetes, Docker, Istio
"DevOps", off course. Gitlab Workflows for CI/CD, Terraform, Terratest, monitoring, tracing
Storage: BigTable and Redis for caching
Testing: Gatling or K6
and let's not forget security on every level

We could go on and on. Instead of focusing on the essentials, i.e., building the API, we end up knee-deep in side-projects.

"Uh, AWS released a new feature for CloudWatch. Let's check that out..."

Autonomous teams are great, but...

As much as I encourage having autonomous teams they come at a price. If we do not pay attention, we end up with a setup illustrated by the following diagram.

We see three teams: A, B and C. Each is staffed with end-to-end experts. Backend engineers, cloud engineers, security experts - all can be found in each team. And as a side-note: try finding all these experts in the current market!

What happens next should not come as a surprise.

Each team faces the same challenges:

code needs to be built
services need to be run
data needs to be stored
and so on.

Since each team is able to work on its own, they each come up with their own solution to each problem. One team uses MySQL to store their data, the other prefers PostgreSQL. One team uses Github Actions, while the other sets up Gitlab for CI/CD.

The challenge is right there. Different solutions to the same problem. Knowledge is not shared between the teams. Maintenance becomes a big concern. In the end, efficiency drops. Teams spend too much time tinkering with aspects that are not related to their product. The next diagram illustrates this.

Each team works on its product, but spends significant amount of time on other things.

What can we do?

Enter the platform team

Most organisations arrive at this point. Initially everything is fine. We have a single team of experts, but then we try to scale out to multiple teams. Now we need some way to reduce the complexity for everybody. This is where usually a platform team is introduced.

Platform teams consist of experts of the supporting technology. For example, a platform team may include Github Actions experts or security engineers. The goal is to take the proven solutions from the feature teams and offer these as platform products. The next illustration visualizes the idea.

Let's explain using an example. The platform team offers mature buildpacks for the teams. In addition, the platform team provides a Gitlab instance and a set of well-maintained Gitlab pipeline templates. Each team relies on the CI/CD product provided by the platform team. We do not re-invent building software.

The same approach works for persistence, too. The platform team creates a set of secure Terraform modules for PostgreSQL and MongoDB. The teams reference these modules, reusing the knowledge baked into the modules.

We reduce complexity for every team, as illustrated by the following image.

Teams can focus on building great products. They do not spend significant amount on non-product tasks.

But, the non-product part is not reduced to zero. There are a couple of reasons for this. First, still the teams have to use, setup and integrate against the platform products. This does not go away magically. Secondly, we want the teams to work on non-product tasks. No, this is not a contradiction, as we will see in a couple of lines below.

Treating feature teams as customers

This sounds easy, but it is actually hard to get right. We should be aware of two things:

Platform teams are not fix-it-fast tiger-teams
and the feature teams are the platform team's customers.

The tiger-team trap

The members of the platform team are experts in their fields. It is tempting to reach out to the platform team, whenever needed. This can be a political or power issue depending on the organisation. If the platform team members are hijacked for non-platform work, then platform development will suffer. And consequently the other feature teams that are waiting for new platform products.

Clear ownership can help here. The responsibilities of the platform team must be clear to everybody. The platform team is in that sense like any other feature team. The only difference is, that the platform team's customers are internal. Their customers are only within the organisation.

Building for the customers

The platform team has a clear customer-producer relation. Feature teams use the platform products. Consider the next illustration.

The feature teams may have feature requests. They may have change requests. However, it is up to the platform team to plan, prioritize and implement their very own backlog. Stability in planing is as important to the platform team as it is to other feature teams. Introducing a platform team makes little sense, if the organisation cannot guarantee this working mode.

But there is also the fact that the platform teams builds products for its customers!

Didn't we already discuss this? No. The platform team builds internal products for the feature teams. It must not be an ivory tower building abstractions and products that nobody wants or needs. If the feature teams don't like the platform products, avoid using them or work around them, then the platform team must go back to the drawing board. They must include the feature teams in their planing and product design.

Again, the platform team treats the feature teams as we would treat any other customer.

The hidden 2-speed-IT

Now that we have established the way platform teams can work efficiently, let's discuss a problem often encountered, when "special" teams are introduced.

The goal of a platform team is to build efficient platform products.

The technology around these products are usually modern and associated with "DevOps" culture and mentality. E.g., the platform teams works on GitHub Actions and Serverless deployment pipelines. These tools tend to be in the spotlight of developer attention.

Compare this to a feature team. They might be using SpringBoot and React. Great frameworks, but nothing that will break the Twitter timeline - at least not at the time of this writing. FOMO is a thing.

Everybody wants to work with Kubernetes, because having that in a CV is a career booster, at the moment. But these exciting tools are owned by the platform team. Again, we end up with a form of 2-speed-IT.

The platform team owns the cool new technology and the feature team is trapped in "only" delivering business functionality.

Innovation on all levels

We need not end up with a toxic 2-speed-IT setup. First of all, the platform team does not arise out of nothing. Let's consider the following illustration.

When we start our cloud journey, we only have feature teams. We do not know what our best practices will be. We have to try different approaches to the same challenges. Only after some time, a couple of months or so, we know what our approach is and then we introduce a platform team. So, the platform team is not an alien part of our organisation. It arises as part of our development.

The other thing that will help avoiding a 2-speed-IT is to allow innovation on all levels. Again, let's consider an illustration.

Suppose, the platform team offers "CI/CD products". These could be buildpacks and Gitlab templates for Node and Kotlin.

Team A wants and needs Go-lang for their development.

Instead of waiting for the platform team, team A goes ahead and builds what they need. They create a buildpack and a CI/CD template and continue developing. Once they are content with their solution they can offer it to the platform team. Finally, the platform team decides if the want to offer Go-lang tooling as part of their platform. They decide if they want to take ownership.

The same inner-source approach can be applied to all platform products. If a team needs a change or extension, then they are allowed to drop a pull-request to the platform team. Everybody is allowed to innovate.

Conclusion

Transforming an IT organisation onto the public cloud can be a daunting task. It involves architectural and technological changes. But even more critical: changes to culture, organisation and processes.

The good news is, that we do not have to transform in a Big Bang. We can adopt the cloud step-by-step as illustrated by the next diagram.

We can discuss on every level if we need further transformation. And only then execute the transformation step-by-step.

In the end, everything we do is about efficiency. And that means we need to keep complexity needs in check. We need mechanisms like a platform team to reduce accidental, superfluous complexity.

As we have discussed in these two short articles: people are key.

If we want change, then we need to include everybody. We should be open to ideas and insights. Only then will we improve and succeed. Adopting an open or hidden 2-speed-IT approach will prove to be a bottleneck and should be avoided. If we are transparent and let the best ideas win, then everybody is engaged.

We end up with a better organisation on every level.

Going Cloud Native - The problem with two-speed-IT

David Schmitz — Tue, 16 Nov 2021 21:06:18 +0000

At the beginning of November a client asked me to join their Cloud Panel and talk on the topic of cloud transformation. You can find the slides on Slideshare.

This article is based on that presentation. So, let's talk about cloud native development and its impact on organisations. Most of my clients are large insurance or financial companies. And they are considering a migration to the cloud. And they are asking for help. The discussions are either based on FUD (fear uncertainty and doubt) or snake-oil.

It is not surprising that the truth lies in between.

This is the first of a two part series. Each introduces things we learnt while moving companies into the public cloud. I focus on culture and organisation. Not because tech is boring. Rather most discussions focus on technology and architecture without ever touching the more social aspects.

Keep in mind that what worked for me and for my clients may not work for you. Context matters.

This first article probes the questions around the organisation and effective collaboration. The follow up text looks at the mythical platform team and its implications.

Efficiency - the reason to adopt the cloud

Before we go into the details let's talk about why we migrate into the cloud.

In a nutshell, it's all about efficiency.

We need to be efficient because we do not know what our customers want. Nobody can specify in detail what is needed. Nobody can predict the future and especially our clients cannot. They don't even know what they want until they see it.

That means the only way to build the correct products, is to implement our ideas as fast as possible and to iterate on them. Improving our products step-by-step.

This leads to the conclusion that our businesses are only as efficient as our IT is. No longer can we treat our IT as a cost-centre. We have to move IT into the heart of our organisation, if we want to be and stay competitive.

And this is where the public cloud enters the game.

The cloud allows us to focus on the essentials. We use SaaS where possible. We do not build our own load-balancers or start hosting a SQL database ourselves. We replace hand-crafted assets with cloud products. E.g., use Google’s Cloud SQL instead of our own PostgreSQL instance. This reduces complexity and allows us to put more energy into our products. We are more efficient.

Marie Kondo your IT

Marie Kondo is a Japanese organising consultant. She specializes in tidying up and reducing superfluous clutter. We can do the same to our IT. There are many strategies for transforming our IT to the cloud. The following four approaches are pretty common:

Lift-and-Shift: we take an asset and host it more or less 1:1 onto the cloud. E.g., taking a monolithic JEE application and move it to Google Cloud Compute VMs. We get rid of the underlying operations components and machines. But do not enjoy other cloud capabilities.

Re-architect: The prime example in every microservice book. We take an existing asset, such as a monolithic JEE application, and redesign it from the ground up. Effectively replacing it for example with a series of new cloud-native microservices. We can use all cloud capabilities, because we are rebuilding and redesigning everything.

Retire: My favourite. We identify assets and processes that we and our customers no longer need. We remove these assets.

Replace: Remember efficiency? "Replace" is all about efficiency. We replace something we took care of ourselves and use a SaaS offering instead. One example could be to replace a self-hosted Kafka with a managed version, e.g., using AWS MSK.

The effort and efficiency of each approach depends on the strategy for moving into cloud. “Lift-and-Shift” might be the best approach, if the goal is replacing a datacenter. If we want to reduce complexity and use SaaS as much as possible, then “Replace” would be the appropriate approach.

The hybrid landscape

In the end, we will end up with a hybrid architecture. We build some assets for the cloud and some assets will stay on-premise, at least for some time.

We can draw two conclusions from this fact:

Firstly, we will have more complexity, at least temporarily. The original datacenter is still around. Maybe smaller and with fewer assets, but still a burden. Operations has to support the original environments and the new cloud environment. This increases effort and cost and we must take this into account from the start.

Secondly, the cloud-hosted assets usually depend on the on-premise assets. More often than not, the cloud-hosted assets need changes to the existing on-premise assets. Firewalls need to be changed, APIs need to be exposed or extended. And so on. This dependency leads to the first potential cultural and organisational trap.

The fact that we have two areas, that can move at different speeds led to something called the two-speed-IT, which we'll discuss next.

The two-speed-IT trap

The idea of a two-speed-IT is not new. It has been around since circa 2014.

McKinsey describes the goal of a two-speed-IT as "A two-speed IT architecture will help companies develop their customer-facing capabilities at high speed while decoupling legacy systems for which release cycles of new functionality stay at a slower pace."

The underlying premise is that you can run your organisation in two different ways. One shiny, great and new. And the other rusty, dusty and old. I will not delve into all the aspects why this is problematic. I concentrate on the organisational part. But to give you a picture, two-speed-IT is like attaching extra rooms to your house because you cannot be bothered to clean up. Not a very sustainable approach, in my eyes.

Language is the smoking gun

Going back to the softer, non-technical aspects. With two-speed-IT the language around the transformation changes in an interesting way.

The cloud-assets are usually associated with a modern and lean technology stack. We use Go-lang, Node and Docker. The development process uses an agile process, such as Shape-Up. We speak of forward-leaning teams. We use "Speed Boats" as metaphors for teams working on these cloud-products.

On the other side of the fence lies the on-premise country. Here are the technologies of days-gone, Corba, Cobol, SOAP and EBCDIC. The process is heavy-weight, maybe even a waterfall with one or two releases per year. We speak of slow-moving tankers, with no ability to either change or react quickly. We even call this "legacy".

Why is this problematic?

As we have seen, the cloud-products usually need access or even changes to the existing assets. That means, we need collaboration between the different areas of engineering. Also let’s not forget the expertise of the people working on these systems. Documentation is outdated the moment it was written. The only way to understand systems is to have the human experts available.

Things become difficult, if the "on-premise-people" are not part of the cloud-transformation.

If people feel left behind and sidetracked, then we don’t get collaboration. Instead we get resentment. People may not be willing to help as much as we need their help. Or - in the worst case - people may end up sabotaging the cloud-transformation. Either knowingly or more often due to negligence. Why should someone support our efforts, if the person is going to be replaced by our project.

Participation brings collaboration

The solution to this dilemma is pretty straightforward. First we need to realise that nobody actually means to do harm or a bad job. Assume Best Intent is often the best way to operate. With this in place, we see that the root of our problem lies in fear.

Fear of being obsolete.
Fear of being left behind.
Fear of losing a job or importance.

We have to get rid of that unfounded fear.

Transparency and communication are key to removing fear.

Bring everybody on board. Mix cloud-product teams with on-premise experts into one end-to-end team. We retrain the staff, offer courses for people willing to learn. We create new roles and positions for our new engineering culture. We offer people a perspective for growing.

And we need to be transparent. We should communicate our rationale for the cloud transformation in clear terms. If we want to get rid of our self-hosted datacenter, then what is the plan for the people operating that datacenter now? How will they be retrained and up-skilled? Who hires the new skills we need? And so on. If we tackle these difficult topics openly, then we stop fear and gossip in their tracks.

Jumping the mountain?

If you want to silence the doubters and fear-mongers, delivery is the only option. Only working software in production will prove that the cloud journey is possible. But, one may ask, even if we bring everybody together and work on this, how can we bring an entire company into the cloud?

Well, one takes one step at a time.

Instead of trying to jump onto the mountain, we take the scenic route and enjoy the journey. We do not need to go all-in-serverless in the first couple of months. We can decide step-by-step what our realistic target actually is. Let's consider the following illustration.

We want to be opportunistic in some areas but full on cloud-native in others. Again, transparency is key. Everybody should understand why we move some areas to the cloud, while others are not.

I cannot stress this enough. We must find a thin-slice of business proving the technology and especially the new way of collaboration. The people will form a band of trust and cooperation that will act as a radiator in our organisation. The thin-slice should be something that adds to our area of business. Not a technical spike, not a proof-of-concept. Rather something essential. Only then will people feel committed and get involved.

Conclusion

Moving into the cloud involves architecture, technology but also organisation and culture. The concrete approach does not change the implications. Whether we lift-and-shift, re-architect, retire or replace, we will end up with a hybrid landscape of new and pre-existing assets.

Two-speed-IT was brought up as a concept around 2014 but has lost its footing in the last couple of years. Reality has caught up with the ideas. Organisations have seen the downsides and implications, some of which I mentioned in this text.

People are key.

Engineers will learn new technology and new architectures anyway. But learning to trust, to work together, to collaborate is so much harder. Especially during a pandemic, when you cannot go around the corner and grab a cup of tea.

Allowing people to take part and to get involved helps building bridges. We do not want any walls in our organisation. Not on a social level and not on a communication level. Software development is a team effort and teams need to trust each other.

The next article examines the question around the concrete teams. Like which skills are needed and how can we scale this in a reasonable way.

Image references

https://unsplash.com/photos/xVptEZzgVfo
https://unsplash.com/photos/IM8ZyYaSW6g
https://de.wikipedia.org/wiki/Schlangen%C3%B6l

Loaded terms and why you should avoid them

David Schmitz — Sat, 21 Aug 2021 08:10:24 +0000

"Just install Istio..."

"The customer always presses the ‘Submit’ button..."

"You only need to set up Kubernetes correctly..."

"Someone should fix this bug..."

Have you heard sentences like these before? As a consultant and software architect I keep hearing these or similar phrases. And each time this triggers my attention and curiosity.

Let me try to explain why these phrases are problematic and how I try to deal with them. And before I get angry mails, I mean no disrespect to the authors, when I quote some text. I use examples to drive home my points.

Don’t condescend to the audience

Most of us read manuals, tutorials, books, and articles as part of their daily business.

Here are some examples.

"With this operator you just have to create and deploy a simple Custom Resource (CR) with your desired rate limit configuration." (see https://events.istio.io/istiocon-2021/sessions/kubernetes-operator-to-manage-rate-limit-istio-configurations/)

"Using PromCat.io is the fastest way to create the dashboard, you just have to execute one command to get your dashboard with all metrics at once." (see https://sysdig.com/blog/monitor-istio/)

Both cases are well intended. Both want to suggest that following the instructions is easy. One command and you are good to go. One piece of code and ready to roll.

But is it that easy?

What if it "just" does not work on my system? Should I ask a question? Well, it should be super easy, right? So asking a question might put me up for ridicule.

I am aware that this seems like I am exaggerating. Nevertheless, using phrases like "...you just have to execute..." can be patronizing to our audience. We lose nothing by dropping that single word.

"Using PromCat.io is the fastest way to create the dashboard, execute one command to get your dashboard with all metrics at once."

Loaded terms lack accountability

Someone is no one. If we are that unspecific and general, we mean everybody and nobody at the same time.

"Someone help me". We mean everybody around us. But we have to wait until somebody actually feels addressed. Everybody could shrug this off and do something else.

"Jill, can you please help me". This is clear and direct. Jill knows what is up. She will either help, or point us to someone who will.

The same holds for decisions. "We don’t think adopting Rust is a good idea at this point". Whom should I talk to, if I disagree? Who is this "we" - only you or you and Bob or you and twenty other people?

"Wang, Bob and I don’t think adopting...." is clear and cannot be misunderstood. If I want to introduce Rust at a later time, I know who wants to join the discussion.

Being explicit leads to accountability. That is why we assign tickets to specific engineers. There is no option to assign a ticket to "somebody".

Loaded terms hide details

"Deployment — describes the desired state and makes sure to change the actual state to the desired state if needed. A deployment manages Pods and ReplicaSets so you don’t have to. Just like magic!" (see https://blog.sourcerer.io/a-kubernetes-quick-start-for-people-who-know-just-enough-about-docker-to-get-by-71c5933b4633)

"Just like magic". Is it though? Or is it well designed and are the underlying details omitted for brevity. Using terms like "magic" deters readers from understanding what is going on. A learning opportunity lost. A better approach can be to add a link to additional documentation: "Read the instructions at https://someaddress/foo.html if you want to dig deeper". We do not patronize the audience. They can decide for themselves.

"Obviously, the service needs to call the authentication service to get a valid token"

"Obviously" is a similar case. If something is obvious, then why mention it. Or is it only obvious to us, and the audience is not smart enough? Again, condescending.

We lose nothing by dropping that term. "The service needs to call the authentication service to get a valid token". It may be obvious or not. Who are we to say?

A final example is "always".

"The customer always authenticates using the mobile device."

Really always? There is truth in the saying that the exception confirms the rule.

If we use "always", we block the road to thinking about exceptions. We simplify reality. What about customers without a mobile device? Are they excluded? Why? Questions we cannot ask if "always" was always true.

Loaded terms are a unspoken invitation for dialog

Although I am triggered by loaded terms, not all is bad. Whenever someone uses terms like "just" or "obviously", I try to use them as a doorway to additional conversation. Most people do not intend to be condescending to their audience. So, if I notice someone using loaded terms I start asking questions.

"Just enter a single command and the system is up". Ok, but what if that command fails? What should I do?

"We need to set up the Kubernetes cluster by tomorrow". Ok, who takes care of this? Is it Jill or Bob?

"The users are always onboarded using a Jira workflow". Ok, but what if Jira is down? Is onboarding stopped then? Or is there a secondary workflow?

Communication is key. So, let’s take these loaded terms as an invitation.

Summary

This very short text is me venting. I do not want to exaggerate the impact of loaded terms. But our texts and our conversations will improve if we avoid them. The audience will be more engaged and invited to join the discussion.

Keep in mind: Obviously, we only need to just drop loaded terms to always write better texts :D

What are your loaded terms - how do you handle them? Let me know. I am curious about your experiences.

The value of time

David Schmitz — Tue, 20 Jul 2021 14:40:34 +0000

At the end of a long work day you sit at your table. You enjoy a nice cup of hot tea and wonder: "What did I achieve today? Where did all the time go? I started working at 8am, worked until 6pm with very little to show for it". The next day, you start again. Work keeps piling up. Your day is full of stress. Yet the pile grows and grows. And the question remains: Where did all the time go?

Does this sound familiar to you?

In this short text, we explore the root of the mess around wasted time. Based on my own experience, we will look at some time-drains and how to improve the situation. Things that will make our life at work calmer and less stressful.

Wake up and realize what is going on

Most of us do not manage their time. Our intuitive feeling of our activity is often wrong. We misjudge where we invest time and for what. This happens to both leaders and individual contributors. The details may vary, but the general problems are the same. Let’s look at some examples.

Meeting Tetris

Tetris is one if not the most well-known videogame ever published. In essence you need to fill lines with blocks of different shapes. Full lines disappear and you get points.

The same happens to our calendar if we do not pay close attention. The calendar is still free from 9:00am to 9:30am? It will take less than 5 minutes for someone to put a "quick status call" into the free slot. Yeah! As in Tetris we filled a line with blocks. But no points for us. And the line does not disappear either. Rather more stress awaits.

Open your calendar now. Do it. And then check how many appointments were set up by yourself. Check how many of the appointments you did not schedule, you still find relevant. Often the answer to both questions is a rather small number. People swamp our calendars requiring our attention. This reduces the time we have for our work items, which keep piling up.

For every appointment, consider if we can add to the discussion or only need the result.
If we can leave the topic to other people, then we should do so. Our colleagues are competent. They will find a good initial result. We can suggest improvements or additions afterwards.

We need to ensure that we have time for actual work. Even if that means blocking time in our calendar. We use the calendar application to liberate us. We block time before and after each meeting to create notes, grab a coffee or relax. Tools help here, for example Office 365 flows can automate this.

And pro tip: we can decline meetings or propose a different time.

"Today is bad for me, but on Wednesday I am available". Most unplanned things can and should be postponed. This is beneficial for other reasons, too. The person requiring us in some appointment has more time to think. New insights or information may make the appointment obsolete. Everybody wins.

People tend to show off with their super full calendars. Let's stop this insanity. The fewer meetings the better.

Boundless altruism

We all like helping others. Let's say we are expert cloud engineers. A team member needs support while debugging a Kubernetes cluster. We play first responder. If something is broken, then we fix it. If the new intern has a question, she is free to approach us and ask away. Our door is always open. For questions. For feedback. For our colleagues.

We help team members progress.

But what about our focus? What about our progress? What about our time?

Having an open door is a noble gesture. Yet, interruptions split our focus time into useless parts. Time does not follow simple mathematical rules. 30 + 30 = 60, okay. But having two times 30 minutes is different from having a full hour of uninterrupted work.

Instead, close the door.

I know from personal experience that this may sound harsh, even rude. But why? You cannot walk into your doctor’s office if you feel like it, or can you? How about having office hours? Something like this: "You can reach out to me with ad hoc topics every Tuesday and Thursday from 1pm to 2pm. Everything else, please, only via email."

Again, most ad hoc things can be postponed. Most things are not life-and-death issues. Most things disappear if people have an extra day to think or read on the issues themselves.

Think of this as a learning opportunity. If people still need our support, then we can be sure that they are better prepared to ask the real questions. A more efficient and focused discussion is the result.

"Let’s have a party" meetings

I work with many customers from large organisations. I keep seeing "workshops" with more than twenty people scheduled for over four hours. Even before we went remote due to the pandemic, these workshops were often a farce. Remote collaboration with Zoom or MS Teams made it worse.

We invite people because we can. Even if we only need them for two minutes, for some "expert judgement". People show up because they are invited. Not because they want or can add to the discussion. Often the workshop ends up with a couple of people actually talking and driving the topics. The rest listens, wastes time and surfs the internet.

Stop having meetings with too many people.

Invite two to three people, I’d recommend three to solve any ties in discussions. Small rounds are able to solve most topics more efficient than larger rounds. You distribute the initial decisions and collect other people’s insights. It is much more effective to start working and improving on a concrete idea afterwards. Starting with a blank slate in a large round is hard and inefficient. This is not to say, that brainstorming does not work, but it can be an inefficient time-hog.

The impatient leader

Who has not received a call, where one of the following phrases was uttered?

"Do you have a minute?"
"Please jump onto a quick call!"
"Please join the workshop tomorrow, it is from 8am to 3pm!"

We all know these phrases one way or the other. And I do not want to put every leader to shame, whoever used variants of these. Sometimes things seem urgent. Even so, these phrases show a disregard for other people’s time. Intentional or not, the negative impacts are the same.

Since we are the leader, people interrupt what they are working on. We reprioritized the person we are talking to. They drop the thing they are working on. They lose focus. They need more time to get back into their work later - to get back into the flow.

All because we needed something RIGHT NOW.

We should step back and check if you need it RIGHT NOW. Again, we can postpone most things. Most things are not that urgent, that we need them RIGHT NOW.

We can give that freedom to the people working for us. Instead of interrupting their work, we send an email "I need your insights on the loan system. Can you please answer the following questions in the next couple of days or find time for a meeting where we go over these questions?" In the meantime WE work on other topics. Usually, we do not have only one single thing we could work on, right? So, we are not blocked because we did not get the answers you were seeking RIGHT NOW.

Patience is a virtue. Demanding something RIGHT NOW is not patient.

The status junkie

I cannot stress it enough. Leaders irritate people if they ask for information, they could look up themselves. If we ask people to do busy work, gathering data we could easily fetch, too, then we waste people's time.

We should refrain from status-questions that tools like Jira answer. If we cannot find the information we were looking for, then we improve our tooling and dashboards. But don’t just rely on tooling. Getting real information from people is essential. It shows that the team's work matters. But, this is a topic for another article "The value of one-on-ones"...

Missguided appreciation

As leaders we want to show that we value the opinions and expertise of our colleagues. When designing a new API for the loans system, we invite all backend and frontend engineers to a workshop. We want to get everyone involved and get everyone’s opinion. And because we are leaders, everybody accepts the invitation and joins the workshop. We end up with a discussion where two or three people talk. Again, everybody else just listens or more likely surfs the internet without paying attention. Despite our good intention to show that we value the experts in the team, we end up wasting the time of most of them.

This is only a single example. Think about all the times you bound too many people into a workshop, but only a handful were actually needed.

So, why not do exactly the opposite?

We invite only the people we absolutely require. The rest, we invite as optional. In the invite we write "We design the new API for the load system. We need Erik, Denise and Wang for the initial design. If you feel you can add to the discussion, then please join. Otherwise, you will all receive a written summary explaining the decisions made. We invite you to comment on the summary. Your comments will not be dismissed but are carefully considered." The invited person decides if joining a workshop makes sense. And people do not miss information. They get the condensed result of the workshop afterwards. They can digest the information on their own time. We did not mess up their day.

We respected their time.

And a final suggestion. During one-on-ones, we should ask people if they feel like you are wasting their time. If we have a good, trustful relationship, then we can get good insights into being a better leader. If we are not conducting one-on-ones as a leader, then we should start with them.

Getting time under control

If you want to get back into control, you need to know where time was spent. This is the starting point - no substitution. I recommend maintaining a detailed time log. Start taking notes.

8:30am - 9:00am Status call with product development
9:00am - 9:30am One-on-one with Wang Smith
…

Maintain these log entries near-time. Do not work from memory. Working from memory distorts your log.

I realize that this sounds like too much work. Trust me, it will be worth it. Maintain a time log for a couple of weeks. Then sit down and analyse that log. Start asking questions for every log entry:

Was this time well spent?
Was this a priority?
Was I the best person for this work?
Could I have delegated this to someone else?

Get rid of the non-essentials. Delegate to other people if they are better suited for the work in question. Delegate to other people if they may grow by doing that work.

Let me be explicit. This is not a call to be lazy. The lazy leader is a bad leader. This is a call to focus on the essentials. Only work on the top priority at all times. If we cannot work on that, then we use the time log as a helper to see where we spent our time instead.

Since we are all bad at changing behaviours, we maintain a time log regularly. Two to three weeks every four months or so.

In conclusion

Time is the most valuable resource for knowledge workers like software engineers. It is the only resource we cannot scale or buy. It is bound. It is the most important factor for results, productivity, and efficiency.

In the end we cannot control how other people manage their time. But we can control our own time and we can control how we treat other people’s time. Realising this is the first step to less chaos and a more sustainable pace.

I pointed out a couple of methods you can use to manage time as a leader or as a single contributor.

Join meetings if it makes sense
Offer office hours to reduce interruptions
Postpone non-blocking, non-essential ad hoc topics
Setup small, focused and well prepared meetings

Be aware of the value of your time and respect other people’s time. They will thank you for it.

I did not come up with this myself. Let me point to two primary resources that I recommend to any leader or to any person growing into a lead position:

Drucker, Peter F. The Effective Executive. (The language is sometimes dated and cringy to a modern ear. But the content is as relevant as it was 50 years ago, when it was first published)
Fournier, Camille. The Manager's Path: A Guide for Tech Leaders Navigating Growth

Please reach out, if you have other means to manage time or if you disagree with my suggestions.

Dealing with data in microservice architectures - part 3 - replication

David Schmitz — Sat, 01 May 2021 13:15:56 +0000

Microservices is a popular and widespread architectural style for building non-trivial applications. They offer immense advantages but also some challenges and traps. Some obvious, some of a more insidious nature. In this brief article, I want to focus on how to integrate microservices.

This overview explains and compares common patterns for dealing with data in microservice architectures. I neither assume to be complete regarding approaches nor do I cover every pro and con of each pattern. As always, experience and context matter.

Four different parts focus on different patterns.

Sharing a database
Synchronous calls
Replication
Event-driven architectures

In the last article we discussed synchronous calls. The resulting challenges on a technological and organisational level led to surprising insights.

This article introduces replication as a pattern for data-integration in microservice landscapes. We will look at the basic concepts and especially at their use in hybrid landscapes. Landscapes, where we want to integrate pre-existing datastores with microservices.

Classical replication at 10 km

First of all, let’s discuss replication itself. We use replication to increase reliability, allow for fail-over and improve performance. The following sketch illustrates a very simplified view of replicating data.

One primary and two secondary instances of a database are set up for replication. Each change to the primary instance is also sent to the secondary instances. If the primary instance fails, an actor could switch to one of the secondary instances.
The primary instance is the only one capable of processing any modifications. The primary instance handles all creation, deletion, or modification requests. The primary instance processes these requests. Then it forwards the changes to the secondary instances. This setup works best for read-mostly use cases, as data can be read from any instance.

This design looks trivial at first sight. But the implications warrant some discussion.

How is consistency ensured?
What happens if actors read from the secondary instances and the instance is not up to date?
What happens if the network breaks between the instances?
and and and.

We discuss these issues in detail in the following sections.

Replication in microservice landscapes

Let’s take a concrete example to drive the discussion. Suppose a bank wants to modernize its IT and move towards a microservice landscape. One large SQL database stores the financial transactions of customers. This database is of high value to the enterprise and is considered to be the Golden Source. This means that this database contains the "truth" about all data stored within. If the database says you transferred some money from A to B, then this is a fact.

The microservice could access the database directly, see the following illustration.

As we saw in part one, sharing a database in this way has its own downsides, such as:

Classical databases often only scale vertically. There is an upper limit to the number of clients such a database can serve at the same time.
The classical database may not be available 24/7. It might have some scheduled, regular downtime.
More often than not, the data-model of the database does not fit the use-cases of the microservices.
The ownership of the data may be unclear.

So, how does one get from the architecture above to something like the following?

This is where advanced replication tactics enter the scene. There are many ways to tackle this problem. We focus on Change-Data-Capture and complex transformation pipelines.

Change-Data-Capture

The Change-Data-Capture (CDC) framework hooks into a source database. The framework captures all changes to the data - hence the name. Afterwards, the CDC framework transforms and writes the data to a target database. One example technology for this use case is Kafka. The following illustration visualizes this approach.

We hook into our source database for example with Kafka Connect and Debezium. Debezium reads the database’s transaction log (TX Log). Debezium forwards changes to the transaction log to Kafka topics. The microservices (MS) consume the data from the topics and fill their databases (DB) as needed. We can optimize the microservice-databases for the respective use case. For example, one microservice might need a PostgreSQL whereas another needs a Redis.

The initial load can take some time. The framework exports the complete source database. The consumer must manifest or reconstruct the destination databases. But once finished, later changes are fast and small. The next diagrams illustrate this.

The CDC pipeline (again, Kafka) processes all entries of the transaction log (TX Log). It stores each entry in Kafka-topics and forwards it to the receiving services. These services in turn manifest their local view of those entries.

The services are operational after the initial run. The CDC pipeline processes only new entries to the transaction log. The next illustration shows this step.

The transaction log contains new entries. These new entries result in new Kafka events. Note that the top-most topic does not contain new events. Kafka forwards the new events of the bottom two topics to the services.

Creating such a streaming platform extends the scope of this article. We only scratched the surface and omitted many relevant details. This article describes the approach using Kafka tooling.

Complex transformation pipelines

We can also transform and enrich the data as part of the data replication process. Let’s use financial transactions as a trivial example again. The next illustration depicts such a pipeline.

The source database stores financial transactions. We use CDC to extract data from the source database and to push the data into raw databases (TX-DB). The raw databases contain copies of the original data.

In our example, some machine learning tool-set (ML-Magic) analyses the raw data. The result of the analysis is a categorization of the financial transaction. The ML-Magic combines the analysis and the financial transaction data. Finally, the ML-Magic stores this result in a separate enhanced business database. In the example, this is a MongoDB database.

Microservices use only the business databases. These are derived from the raw data and are optimized for specific use cases. The business database could for example be optimized and contain a denormalized view of the data. New business databases can be added as new use cases arise.

Implications

Change-data-capture and transformation pipelines are both valid approaches. Both help to move from existing system landscapes towards a more flexible architecture. We can adopt a microservice landscape without any modification to the existing assets. The microservices each end up with their optimized data-store. This decouples the development teams and increases agility.

However, introducing Kafka and similar frameworks increases the development complexity. Even so,
this may be a valid investment. The resulting architecture may enable the business side to move and grow faster.

But nothing is a silver bullet. We identify at least the following questions that are worth further investigation:

The Golden Source remains. What should happen if microservices create new data or change existing data?
CDC and transformation pipelines take time. How should we deal with data in different states in different parts of our system?
How can we ensure that data is only used by systems allowed to use said data?

Again, let’s discuss a concrete example. We have talked about financial transactions. Our current system looks like illustrated by the following diagram.

We hook into the source database (Golden Source) again with Kafka Connect and Debezium (I). Kafka topics store transaction log entries as events. The microservice consumes the topics it needs (II). Afterwards, it manifests a local view in its local business database (III).

If we want to read financial transactions, we need to query the local business database. The microservice owns the business database. In the following illustration, a caller sends a GET request to the microservice (I). The microservice queries the optimized local database (II) and answers the GET request.

But what happens if a client asks the microservice to make a new transfer? A caller sends a POST request to the microservice. The microservice adds the new transaction to its local database. Remember that the Golden Source is the pre-existing source database. It contains the truth. Especially the truth about any financial transactions. So we need to send the information about the new transaction also to this database.

How do we approach this?

We could update the local database and then call the API to update the Golden Source. But what happens if the API call fails? Then we need to clean-up the local database and send the error also to the caller.

We could call the API first and only update the local database if the call was successful. Again, this is not as simple as it seems. The problem is the remote call to the API. There are error cases like e.g. timeouts, that leave us clueless. We do not know if the API call booked the transfer at all.

In the end, it doesn't matter. We cannot span a transactional context across a HTTP API call and a local database in a meaningful way. Consider the documentation of the good-old HeuristicCommitException.

In a distributed environment communications failures can happen. If communication between the transaction manager and a recoverable resource is not possible for an extended period of time, the recoverable resource may decide to unilaterally commit or rollback changes done in the context of a transaction. Such a decision is called a heuristic decision. It is one of the worst errors that may happen in a transaction system, as it can lead to parts of the transaction being committed while other parts are rolled back, thus violating the atomicity property of transaction and possibly leading to data integrity corruption.

There is a pattern that can help us with this scenario: the outbox. We introduce a message log table (ML). A so-called outbox handler forwards all data of the message log to the Golden Source. See the following illustration.

Updating the message log and the transaction table (TX) happens as one transaction (II and III). Both tables are part of the same database, so a single local transaction is enough.
The microservice can return the result to the caller and finish the request.

Now we get to the tricky part. Handling the message log. Often the API triggers some process side-effects besides updating the Golden Source. For example, calling out to other APIs or sending messages downstream.

The following diagram explores the communication flow.

The Outbox Handler polls the message log table or subscribes to changes to it (I). It reads the data and calls the API (II). If calling the API was successful, then the handler marks the message log entry as done. Otherwise, the Outbox Handler retries the operation. If all fails, the handler marks the entry as not processable.

In such cases, other mitigation strategies come into place. But this is outside of our discussion.

Suppose the API call was successful. Next, among other things, the API call updates the Golden Source (III). This triggers the CDC pipeline. The CDC component captures the new data added to the Golden Source’s transaction log. Afterwards this data ends up in the Kafka topics (IV). The consuming microservice receives that data (V). Finally, the microservice updates the business database. The database now reflects the state of the Golden Source, too (VI).

We have omitted many technical details. Still, the complexity of this pattern should stand out. Many things could go wrong at any point. A solid solution must find mitigation strategies for each error case.

Even so, the eventual consistent character of this architecture does not go away. The new data stored in the business database does not reflect the Golden Source data right away. The time delay may or may not be an issue for the concrete use case. But we need to be aware of it and should analyse the impact of it.

The same holds for the topic of data governance. The patterns of this article lead to data replication, i.e. to storing the same data in many places. Depending on the regulatory requirements, we need to control which parts of the system landscape can use which data. This has to be set in place right from the beginning. Refactoring data governance controls into an existing landscape can be very challenging.

Last but not least, let’s not forget that CDC leads to technical events. Real domain events representing business-level processes are not captured.

Summary

All things considered, this can be a good option to grow from a static large datastore to a reactive and distributed multi-datastore landscape.

Moving towards modern architectures without any major refactoring of existing systems is possible. We can leverage and use so-called legacy systems without any direct extra cost of doing so.

We do not change existing systems. So we end up with the "old", "legacy" system landscape, and the new microservice landscape. Complexity and cost increase. We need more engineers. We need more infrastructure. And so on.

But, we must not confuse this with event-sourcing or an event driven architecture. It can be the first step into those areas, but only the first. We are considering technical events a la "Row x in Table y has changed in values A, D, F". This is different from saying "SEPA Transaction executed". And we have to deal with eventual consistency. There is no way of avoiding this.

In conclusion, we need to check the advantages and implications of the approaches. There cannot be the best answer. We need to consider the concrete requirements and use cases. These determine if this approach is a good fit for our challenge and strategy.

Here are some references for more in-depth information on related topics:

Change Data Capture Pipelines with Debezium and Kafka Streams
No More Silos: How to Integrate Your Databases with Apache Kafka and CDC
Reactive design patterns, Roland Kuhn et. al.
Reliable Microservices Data Exchange With the Outbox Pattern

Outlook

The next and final installment of this series tackles event-sourcing and event-driven-architectures. Both powerful and related concepts. We will look at their implementation and advantages. But as always also at their implications on design and architecture.

Dealing with data in microservice architectures - part 2 - synchronous calls

David Schmitz — Sun, 13 Dec 2020 10:26:17 +0000

Microservices are a popular and wide-spread architectural style for building non-trivial applications. They offer huge advantages but also some challenges and traps. Some obvious, some of a more insidious nature. In this short article, I want to focus on how to integrate microservices.

This overview explains and compares common patterns for dealing with data in microservice architectures. I neither assume to be complete with regards to available approaches nor do I cover every pros and con of each pattern. As always, experience and context matter.

Four different parts focus on different patterns.

Sharing a database
Synchronous calls
Replication
Event-driven architectures

The previous article looked at integrating microservices using one shared database. Sharing a database seems to be a straightforward approach. Even so, it led to architectural and organisational challenges.

In this part, we‘ll look at coupling microservices with synchronous calls. We'll start by explaining the pattern itself. Then we'll analyse technological and architectural aspects.

Synchronous Calls

Integrating microservices through synchronous calls is one of the more straightforward patterns. If a service A needs data owned by another service B, then A uses the API of B to get to whatever A needs.

The following image illustrates this pattern.

Two services serve as examples: one managing bank-accounts and one managing access-privileges. Let’s say a web application needs to fetch the bank-account overview. The web application sends a GET request to the bank-account service. The latter service replies with the overview data.

The bank-account service checks if the caller can view the requested data. The bank-account service sends a GET request to the access-privilege service. The access-privilege service checks the validity of the request and answers.

The actual communication protocol does not impact this discussion much. The arguments do not change whether we use REST, gRPC, or even SOAP.

Even so, be aware that the actual protocol may worsen some implications. For example by increasing the communication overhead. Concrete requirements and use cases should drive the selection of the protocol.

The advantages of reusing assets via API calls are clear.

The API protects its users from internal implementation details. Whether a service uses a SQL database or a graph database does not leak to its users.
Even changes become easier. Changes to the internal data structure and logic do not impact users of the API. This allows for a more nuanced release strategy for new features. Finally, API reuse does not require special middleware or infrastructure. It does not get easier than a HTTPS call.

I do not want to go into any detailed discussion around the advantages of great APIs. The internet provides lot's of documentation on this topic.

Implications

Let’s have a look at the implications of coupling microservices in this way. We start with technical issues like availability and move to organisational aspects towards the end.

Testing

One more or less visible implication is the test setup. The tester needs to fulfil the dependency on the access privilege service.

It does not matter if we are running the service on a local machine or during an integration test. This requires either a complex setup running all services, e.g. with docker-compose. Or we could create a stub or mock using e.g. Mountebank for the downstream dependencies.

Both approaches lead to a higher risk of finding bugs and issues in later stages.

Availability

As shown, the bank-account service depends on the access-privilege service at runtime. Downtimes of the access-privilege service impact the bank-account service.

Graceful degradation is essential. Everything is better from a customer-perspective than a 503 error-page.

Consider the access-privilege service being up and running. But does not respond fast enough. The reasons could be many: some database hick-up, or network congestion. The following image illustrates this case. When sending a GET request, the bank-account runs into a timeout (TOUT).

There is no general best approach on how to deal with these scenarios. In case of idempotent requests, e.g. GET, then retrying the request may be an option. But even this may not be the case.

The access-privilege service may be under extraordinary stress. Retrying in this scenario will make things even worse. Google’s SRE book explains the different implications in detail.

Things get even more complicated if we take non-idempotent requests into account. Let’s look at a different use case. An actor wants to transfer money to some bank account, illustrated by the following image.

The actor uses the transaction (TRX) service (I) to execute a money transfer. The TRX service relies on a third party API for the actual money transfer (II). The third-party API replies with a successful response (III). Finally, the TRX service replies to the actor (IV).

But what happens, if things do not work as expected.

What happens if the connection from the actor to the TRX service gets dropped. The calling service has no idea, whether the money transfer succeeded or not.

Did the transfer service execute the request? Did it only fail to return a response to the calling service?
Can we retry the money transfer without the risk of transferring the money twice?

This situation requires implementing extra orchestration and compensation logic. Business request ids can determine if a money transfer request was already served.

Latency

The following image illustrates the impact on latency.

The bank-account service calls the access privilege service. The access privilege service calls the business partner service. The call between access and business partner takes 1 second. Bank-account sends the final reply after 2 seconds.

Now assume that the bank-account service should respond in at most 1,5 seconds. In this case, the end-to-end example above will not meet that rule.

The access privilege service could skip the call to business partner service. It could return an error to the bank-account service instead. Passing a deadline from service to service may be one solution. The bank-account service passes an extra deadline parameter. The deadline parameter says: “Hey, access privilege service, you have 1 second to reply to my request. Otherwise, I don’t need an answer and won’t wait for an answer”.

The details don’t matter. The performance will suffer because of the communication overhead. Patterns or workarounds that deal with deadlines complicate service implementation.

Team Interlock

This implication is less technical but rather organisational. Let’s consider an extended scenario, illustrated by the following image.

The bank-account and the access privilege services depend on a third service. This service provides business partner information. For example first and last name, mail address, and so on. In principle this setup may work, keeping in mind the implications outlined above.

The more severe problem lies on the organisational level. Suppose different teams own each service, see the following illustration.

Team A and team B depend on team C. Looking at the relationship between the teams, the underlying challenge becomes obvious.

First, let’s consider a "customer/supplier" relationship. Team C provides a service for the other teams and both team A and B can pass feature requests to team C.

In this scenario, team C may face a prioritisation problem. Should A or B get their requested feature first? What about conflicting requirements? What about versioning?

This can lead to very complicated management discussion and change management processes. Note that politics can and will play a role here. If the owner of C is more incentivised to support A than B then this may become problematic for team B. This boils down to bonuses or career moves.

Another interesting relationship is the "conformist", where A and B have to take team C’s services as-is. This means both are at the mercy of team C. If team C changes the API for whatever reason, then team A and B have to conform to the new version. This introduces unplanned engineering effort into A and B. Furthermore, the risk for issues when deploying new versions into production increases.

These are only two examples, the relationships can be very complex. Vernon goes into extreme detail in his book about "Implementing Domain Driven Design".

Release cascade

As a last implication, let’s consider the example with three services and three teams again.

Team A is in a problematic planning situation. The bank-account service depends on both of the other services.
This means that team B and C must release before team A can release its service.

The following image illustrates this situation.

Team A finished their implementation in February. Yet, they have to postpone until May before they can continue with deployment. Team B also needs to wait for team C to finish its implementation. The release cascade becomes clear. This requires a high degree of planning alignment between the teams. Finger-pointing due to missed release dates can be one result.

Release trains are one method to cope with such temporal dependencies. Although this approach can work, it can also lead to a decrease in quality. If team C has to meet the deadline of April, they take short-cuts and skip testing.

Summary

Using synchronous calls to integrate microservices is a straightforward implementation pattern. Easy to put in place, debug, and analyse.

Yet, as we have seen, technical and organisational challenges - some obvious, some not so.

In any case, the reliability depends on appropriate timeout configurations and circuit breakers.
Effective monitoring, alerting and clear service level objectives make life easier for everybody.

Graceful degradation can mitigate business impact. E.g. falling back to a locally cached variant or some default behaviour. The solution space depends on the business domain. The person owning the service must decide on the proper strategy.

The organisational implications are harder to tackle. Personal bias, politics, and money may impact the level of cooperation between teams. Especially, if the teams cross project boundaries. For E.g. one team working on a new and shiny cloud service and the other maintains a not-so-shiny backend legacy service.

We should try to make these dependencies transparent. Especially the kind of relationship (conformist etc.) is very helpful. This can support dealing with such situations. Context-maps from Strategic-Domain-Driven-Design are one great tool to visualise this.

It is worth mentioning, that the organisational challenges are the same for code and library reuse. If different teams own reused libraries, then the same questions need an answer.

If you want to dig deeper into these topics, then the following books are worth checking out:

Site Reliability Engineering: How Google Runs Production Systems, by B. Beyer et. al.
Release It!: Design and Deploy Production-Ready Software, by M. Nygard
Implementing Domain-Driven Design, by V. Vernon

Pros

easy to implement
no direct dependency to persistence technology
debugging and end-to-end monitoring possible
dependencies are often explicit

Cons

Latency and availability suffers
Testing requires a more complex setup
Release coordination and change management required
Politics may make things harder

Outlook

The next article will look at data replication. Autonomous services each relying on a local database. Data is replicated between the databases or some intermediary mechanism. This is the precursor to the final article which tackles asynchronous events.

Dealing with data in microservice architectures - part 1 - share the data?

David Schmitz — Sun, 27 Sep 2020 09:58:05 +0000

This is my first article after a long project and COVID induced hiatus. I'll present different ways to deal with data and dependencies in microservice architectures.

Microservices are a wide-spread architectural style for building distributed applications. They offer huge advantages but also some challenges and traps. Some obvious, some of a more insidious nature. In this short article, I want to focus on how to deal with data, when building microservices.

Dealing with data and dependencies in a microservice architecture is difficult. There is no one-size-fits-all solution. The trade-offs can be the difference between succeeding and utter disaster. The typical „every microservice hat its own database“ seems like good advice. But as we will see below has its challenges.

This overview compares popular patterns for dealing with data in microservice architectures. I'll focus on only four, which in my experience are the most common once. As always, experience and context matter. There are many ways to tackle this problem domain.

Four different parts focus on one specific approach:

Sharing a database
Synchronous calls
Replication
Event-driven architectures

Sharing a database

The first pattern is one of the more common approaches to dealing with data. See the following illustration.

As shown, two services A and B use and access the same database. There is no real separation on a business or technical level. As indicated by the colour-coding, the database holds data (schemas, tables,...) that belong to the domain of service A and service B and somehow extra data that neither seems to belong to A or B.

This approach may be a starting point for brownfield implementations. Services must often use a preexisting database as-is in such environments.

But even greenfield implementations adopt this style. It is straightforward to use and most familiar to engineers.
Looking at maintenance and knowledge distribution, the advantage is clear. Knowledge sharing and reuse is far easier, if all engineers focus on a single technology. In a polyglot environment engineers must maintain, many different database technologies.

Which leads us to operations.

This approach is most familiar from an operations point of view. Operations must only cope with a single database infrastructure. Monitoring, backup, security become easier. Ask yourself the question „how many databases do you consider yourself an expert in?“.
Many engineers are at most expert in one or many two databases. Knowing how to connect to a database and issue queries does not make one an expert in that database.

But, sharing one database has some more or less severe and not obvious implications.

First of all, let's consider the technical implications.

In-transparent schema coupling

Let's go back to the diagram above. We can see that the database contains data from at least three different services. If designed according to DDD - one can presume three different domains. As an example, service A handles users. It may have a table like the following:

Service B also requires some user-related data, e.g. for generating invoices. So, it relies on the name and the address columns of the user database.

Now, the product owner of the user administration requires a change to the user data. For example, the STREET_AND_NUMBER column are split into STREET and NUMBER columns. The team maintaining service A knows about that change. They implement it, illustrated by the following image.

But what about the team owning service B?

There are two cases of interest here: either they do not know about the change, or they do.

Scenario 1: The change surprises the team maintaining service B

Team A changes the table as required by their product owner. They apply any necessary change to their code. All tests pass and they deploy the service A and the table changes to an integration test stage.

Only then can team B discover breaking integration tests. They notice the table change. Now they have to plan extra effort for migrating data and adopting the change to the user table. This delays the implementation of features they had planed instead.

Be aware that this is the best case in this scenario. Imagine discovering such a problem in production.

Scenario 2: The teams communicate the schema changes

Team A plans the required change. Knowing that team B relies on the user data they approach team B and align on the changes. They come up with a mitigation strategy. The plan to maintain the previous and the new schema for some time. This allows team B to catch up and work around this disruption.

The implications are the same as in scenario 1. Team B has to conform to the change of team A. Again this leads to a delay of essential business features they had planned.

Also, one must notice that this requires team A to be aware of any consumers of "their" data. Why the quotes around "their"? One could argue that team A does not own the user data. They have consumers relying on that data. Depending on their organizational power, even team A may not be able to proceed as they see fit.

What about a new team C, that is unaware of team A. And what about technical processes like backups and reports? The change impacts all downstream consumers of the user data.

In the worst case, you may end up with an organisational power struggle.

Runtime coupling

But there are other challenges, too, that are not as obvious as dependency management. Multiple services relying on the same database share the underlying technical resources: Connection pools, CPU, memory,...

If one service submits a very expensive query, then this may impact other services. Debugging sessions become a game of hunting in the dark, unless monitoring is setup. Discovering such cases of service-spanning runtime couplings is not an easy feat.

The same holds for locks, too, and may lead to deadlocks. If service A locks a table column and service B needs that data, then you are in for some ugly analysis. This is like debugging race conditions in a JVM, only in a distributed scenario.

Finally, most SQL databases struggle with horizontal scalability. This means there may be an upper limit to how many services can use a database in a performant way. There are notable exceptions like Google‘s Cloud Spanner and the impact depends on the database technology (NoSQL databases scale horizontally, e.g.). But even those need a close look at the issues pointed out in this section.

Mitigating the downsides

There are some ways to mitigate the implications of sharing one database.
For example, the engineers could structure the database itself. Schemas and clear table ownership are a good starting point. The following diagram illustrates this.

Service A owns its schema and the tables in that schema. If another service needs that data, then it is clear who is in charge of that data.

This relation is called Conformist. Downstream consumers have no say with regards to the schema. They need to conform to whatever team A decides.

This approach is sometimes the first step in migrating to cleaner data-approaches. Especially for brownfield environments a sensible strategy. You start by refactoring the components of a monolith towards clean schema ownership. Next you can migrate step-by-step to the approaches described in the following articles.

Summary

It should be clear, that sharing the data on this level requires extra coordination. Development need processes to align releases and planning. Teams are not autonomous any longer but rather locked in a distributed data monolith. In general, I recommend this as a starting point for brownfield projects. If possible, I would rather recommend considering one of the following patterns instead.

Pro

Easy to understand and operate
Knowledge sharing and setting up teams is easier
Often a starting point for brownfield scenarios

Cons

Services and thus teams are coupled organisationally and on a technology level
Coupling is more or less in-transparent
Difficult to orchestrate release dependencies
Insidious bugs are found once released to production
Prone to behind-the-doors power struggles

Outlook

The next article discusses synchronous calls between services. There should be no problems, when services "just" send a GET request to other services, right? Well, as we'll see there are some issues and trade-offs.

Until then feel free to leave comments. Please point out any omissions or different point-of-views.

This post was published as https://koenighotze.de/microservices-data-patterns/part1.html