DEV Community: Ujjwal Raj

A Practical Guide to Staying Ahead in Software Engineering

Ujjwal Raj — Sun, 14 Jun 2026 12:54:22 +0000

I always worry about keeping myself updated in this fast-paced era. But I've noticed that I stay quite up to date compared to a good number of developers around me.

I want to share my journey with the people around me, especially new engineers entering this fast moving bullet train called software engineering.

For me, it was a bit of luck in the beginning, but I'll tell you how not to be at the mercy of luck.

In 2023, when I was very new to university and only knew a bit of competitive programming and had just an introduction to building a basic web app using JavaScript, luckily or maybe being good at CP, I got an internship at a fintech organisation in Bangalore, India. Even more luckily, I got a project involving RAG, LLMs, and search algorithms. That pushed me into understanding how data science and AI are shaping the world.

In 2023, ChatGPT was still relatively new, but I sensed the future (or at least what is happening today) and how it would shape developers' lifestyles and the future of engineering.

I was a bit scared, but more importantly, I got an opportunity to learn more and stay hungry to grasp everything happening in the software world. I will share a roadmap and a few tips so that you can stay sharp in this fast-paced future.

FYI, I am a developer who rarely works on core ML or LLMs, but I am still knowledgeable enough to understand new papers published by big players like DeepSeek and Meta. I am aware of the potential of the latest models, frameworks, and developments.

My day-to-day work does not involve ML or AI (except some vibe coding using Claude and GPT). Rather, it involves distributed systems, backend engineering, infrastructure, and related areas. I will guide you on how you can stay in the race.

Building the Foundation

I am a mechanical engineer by formal education. So most of my university courses were not focused on the foundations of ML, AI, or the mathematics required to understand modern AI systems and deep learning.

I remember I got a free Udemy course (which is currently not free). By the way, I have never spent money on a Udemy course except one that was not related to software. That course covered different machine learning algorithms and techniques.

I did not code along with it, but I made sure I watched almost every video and understood the concepts behind each algorithm. That built my fundamentals for the future. From there, I never looked back.

Here are the resources I recommend each of you follow to build the basics if you are not a data scientist or ML engineer but still want to stay updated.

Codebasics YouTube Channel (2–3 Weeks)

https://www.youtube.com/@codebasics/playlists

Watch an ML tutorial playlist and a DL tutorial playlist from this YouTube channel. You can code along if you want. Understanding the basics of all algorithms is a must. Do not skip the fundamentals. Building a habit of understanding fundamentals will help you keep up.

You can also cover RL or image processing if you are interested.

Learn How Great Software Is Built

I always recommend building the habit of grasping concepts after reading about them and making sense of them. This will help you later when understanding research papers in topics you are interested in.

Read Clean Code by Uncle Bob. That is a must. It can push you a year ahead in experience.

Read at least one book on architecture. Pick Clean Architecture by Uncle Bob or even Domain-Driven Design by Eric Evans. At least one is fine.

You can visit www.ujjwalraj.com to get more reviews on what I have read and recommend to beginners.

If you become more interested in this area, you will automatically start picking up more books, courses, and learning resources in this domain. If not, someone else will pick this path, and you will become an expert in some other domain.

Master Your Domain

If you are a working professional, you should understand everything in and around what you work on.

I got this advice from the CTO of a very popular startup in Bangalore. Trust me, it is worth following.

For example, I know how Python works behind the scenes. I know how Docker's architecture works. I understand Kubernetes. I literally read a book on Docker and another on Kubernetes to satisfy my curiosity.

That said, going through a good YouTube video is also fine if you want quicker results. But be confident about it. Confidence gives you more energy and hunger to learn.

Learn Alongside Others

People recommend reading newsletters. But I find it very difficult to maintain the discipline consistently. Eventually, it can exhaust your hunger to learn more.

So instead, join a good community that has discipline, or maybe a cohort (preferably free).

If that's not possible, build one.

I actively participate in a software reading club at my company, which I joined in the second hour after joining the firm. A link to the group appeared on our public Teams channels, and I joined immediately. I'm glad I did.

We have weekly reading sessions, and I love volunteering as a moderator. We covered three books last year by spending just one hour together every Friday.

Participate in Hackathons if time permits

This is also important to keep the spark alive.

Participate in online or offline hackathons with challenging problem statements if you have the time and bandwidth. The agenda is - Build Things, Don't Just Consume Content.

Stay Current With AI Tools, watch all videos uploaded on claude YT channel (Will take < 4 hrs in total)

I am recommending this in 2026. This is a must because Claude is shining right now. Later it could be OpenAI, Gemini, or someone else—we never know.

The goal is to keep yourself up to date with AI-assisted coding tools so you can be more productive, create more time for yourself, and ultimately spend more time on life outside work.

Read Build a Large Language Model (From Scratch) by Sebastian Raschka or Watch a YouTube Tutorial on How an LLM Works

This is not mandatory, but if you want to understand papers released by DeepSeek, Meta, and others, it helps a lot—especially if you are not from an ML or data science background and are more focused on software development.

Maintain a Learning Backlog

I use Todoist, by the way, and have organized it like a Jira board. It's up to you how you maintain a To-Read list. I would love to hear recommendations on this as well.

Maintaining a To-Read or To-Learn list helps you track your progress. I have a lot in my backlog, and this is a glimpse of it.

You don't have to read everything in one day or one week. But you do need to be consistent. So you need to track everything.

Add all your hot topics there and start picking them up when you are bored or when you feel like learning something during a dull weekend.

Time Management

Once you are here, don't stop. Keep doing it because the world is moving fast.

Learn time management. I think engineers are naturally good at it.

Once you become good at learning quickly—which will take around 3–4 months if you follow the advice above—you are going to save a lot of time.

Be patient and learn how to learn fast. The steps above provide a good roadmap if you don't know what to learn or where to start.

In the last year, I got premature-promotion at work, learned swimming, started volunteering more in the reading club, started working out 3–4 times a week, and wrote technical blogs as well. Recently I started learning to sketch (the cover image of this blog is sketched by me).

I attended family functions in my hometown and all major festivals. I am very busy at work, but I still have a lot of time left for myself, my friends, and my family.

Here, I am not lucky—I am developing this skill.

I used to struggle a lot, but I learned this from a senior architect at my previous company, where I joined a cohort. (That's why I recommend joining a good community.)

I still need to improve a lot, but I am definitely better than I was a year ago.

Accept Slow Weeks

Earlier, I used to get nervous if I failed to learn anything in a week.

But slowly, I realized that this is fine.

It is normal to have an unproductive week, but don't forget to add things to your backlog or To-Do list; otherwise, you will forget them.

Keep Your Brain Under Load

One piece of advice I would give is this: always keep your brain under some form of productive stress.

Software engineering is not only about knowledge; it is also about thinking. If you stop challenging yourself for long periods, you may notice that your problem-solving speed, curiosity, and ability to grasp new concepts start to decline.

You do not have to participate in coding contests if that is not your thing. But you should regularly do something that pushes your limits.

That could be:

Building side projects
Reading technical books
LeetCode/Codeforces coding contests
Exploring a new technology
Writing blogs
Understanding a research paper
Teaching others

If you do not use your muscles, they weaken over time. The same applies to your ability to learn and think deeply.

Keep your brain under load, and learning new things will become significantly easier throughout your career.

Conclusion

If you take only one thing from this article, let it be this: You do not need to be the smartest engineer in the room. You do not need to read every paper. You do not need to know every framework. You only need to remain curious and consistent. Technology changes quickly, but the ability to learn has remained the most valuable skill throughout every generation of software engineering. Build that skill, and you will adapt to whatever comes next.

I would love to hear your thoughts and how you keep yourself updated in this rapidly changing industry.

Distributed Systems: Raft Algorithm for Leader Election

Ujjwal Raj — Sat, 21 Feb 2026 10:34:27 +0000

Welcome to another article in the distributed systems series. We will discuss the Raft algorithm and how a leader is elected in a replicated system when the leader goes down and followers become candidates.

Safety and Liveness

In a replicated system, safety means that at any time there is at most one leader. Liveness ensures that during failure, a leader is re-elected.

States of the Machine

Any process, at a given time, is in one of three states — leader, follower, or candidate. Every election is identified by a term value, which is simply an integer.

When the system starts up, all processes are in the follower state.

Ideally, every follower must receive heartbeats from the current leader containing the election term information. If a heartbeat is not received and a timeout occurs, the follower concludes that the leader is dead. At that point, the follower starts a new election by incrementing the current term and transitioning to the candidate state. It then votes for itself and sends a request to all processes in the system to vote for it, stamping the request with the current election term.

From here, three things can happen:

The process wins the election: This happens if a majority of other processes vote for the candidate. It then transitions into the leader state and starts sending requests to other processes. Note that other processes vote for at most one candidate on a first-come, first-served basis for a given term.
Another process wins the election: If the candidate receives a heartbeat from any other process having a term greater than or equal to the current term, it accepts the other process as the leader. It then transitions back to the follower state.
A period of time passes with no winner: This rarely happens, but if no candidate receives a majority vote, a re-election is conducted.

Conclusion

The Raft algorithm ensures reliable leader election in a replicated system by using terms, heartbeats, and majority voting. It guarantees safety by allowing only one leader at a time and ensures liveness by electing a new leader when the current one fails. This makes Raft simple, predictable, and suitable for building fault-tolerant distributed systems.

Follow for more articles on distributed systems and computer science.

Part II : Building My First Large Language Model from Scratch

Ujjwal Raj — Sat, 01 Nov 2025 12:07:49 +0000

Welcome to another article on building an LLM from scratch.

There’s been a little delay in bringing you the second part of this series, as I took some time off to enjoy my Diwali vacations - a much-needed break filled with lights to dive back into building LLMs from scratch.

If you haven’t read the first one, it’s a quick read - you can go ahead and check it out: Building My First Large Language Model from Scratch

Till this point, we have coded the multi-headed attention mechanism.

Overview of Building LLM Architecture

We will start building the LLM architecture now. After the multi-head mechanism, we get a tensor which we will process through different deep learning neural network layers. The weights of these layers are collectively called parameters in the context of LLMs.
You must have heard that GPT-2 Small has 124M parameters - that’s the total number of connected node pairs.

We used the same values as GPT as our model’s basis.

GPT_CONFIG_124M = {
      "vocab_size" : 50257,    # vocab size of tiktoken
      "context_length" : 1024, # maximum input tokens handled by positional embedding layer (see previous blog)
      "emb_dim" : 786,         # each token is converted into a 786-dimensional vector
      "n_heads" : 12,          # number of heads in the attention mechanism (see previous blog)
      "n_layers" : 12,         # number of transformer blocks in the model
      "drop_rates" : 0.1,      # dropout rate for masking
      "qkv_bias" : False       # Query-Key-Value bias in attention mechanism
}

Using the above config, a 12-layer transformer block is built. Let’s start building a single layer of such a transformer model.

Applying Layer Normalization

Training deep learning neural networks with multiple layers can sometimes be challenging due to problems like vanishing or exploding gradients. The learning process may struggle to minimize the loss function during backpropagation.

We use layer normalization to improve the stability and efficiency of neural network training. The idea is to have a mean of 0 and a variance of 1 (unit variance) for the output of a layer. In GPT-2, layer normalization is applied before and after the multi-head attention module.

You can learn more about layer normalization and its implementation here: https://medium.com/@sujathamudadla1213/layer-normalization-48ee115a14a4

Just like in GPT-2, layer normalization is chosen over batch normalization for greater flexibility and stability. It is also beneficial in distributed training.

Implementing the Activation Function (GELU) in Feedforward Network

A cheaper approximation version of GELU is used as the activation function.

The smoothness of GELU leads to better optimization during training than ReLU (as shown in the figure below).

ReLU has a sharper corner near zero, which makes optimization harder during training, as it outputs zero for any negative value. In contrast, with GELU, neurons that receive negative input still contribute to the learning process to a small extent.

A Feedforward Layer Architecture

The above figure shows how a feedforward layer looks. It contains three layers - a linear layer, followed by a non-linear GELU normalization layer, and then another linear layer. The GELU layer has a dimension four times that of the linear layer. So the embedding dimensions are first expanded by four times and then reduced back by four times. Doing this provides a better representation space.

Shortcut Connection

We provide an alternate path that skips the feedforward layer. This is achieved by adding the output of one layer to the output of a later layer (as shown in the figure). This helps prevent gradient vanishing problems.

Assembling the Transformer Block

Now we will assemble one unit of the transformer block. This will be repeated 12 times in the final LLM architecture.

After the dropout layer, a layer normalization layer is added, followed by a feedforward layer and another dropout layer. The shortcut connection is created as shown in the figure.

The idea is that the attention mechanism identifies and analyzes the relationships between elements in the input sequence, while the feedforward network modifies the data individually at each position. This way, the model is enhanced to handle complex patterns.

The dropout layer helps prevent overfitting.

Generating Text: Greedy Decoding

The LLM is assembled as shown in the figure below.
The final output normalization is done, followed by a linear layer that converts each token vector (786-dim) to a vocabulary-sized vector (~50k-dim). This output vector is called logits.

Now we’ll understand how the final tensor is used to compute the next token, which forms the LLM’s response.

To compute the next generated token, we generate a logits vector. The logits vector has a dimension equal to the vocabulary size (~50k in our case). It represents the probability of occurrence of each token.
For example, if index 2 of the logits vector has a value of 0.12, it means the probability of the 2nd token ID being next is 0.12. So, we simply select the token ID with the highest value as the next generated token.

The following figure from Towards AI helps illustrate the flow:

The softmax function is used to convert the logits into a probability distribution. Since softmax is a monotonic function, you can take the maximum of the logits directly. The idea remains the same - pick the most probable next token.

A typical logits tensor will have the shape:
(batch size, sequence size, vocab size)

The above figure shows how each logits vector is generated iteratively, and each token is produced in every iteration.

Conclusion

With this, we’ve laid the foundation for how individual transformer blocks come together to form a complete LLM.

Next, we’ll explore how we label an LLM based on its parameters, how weights can be reused in input and output layers, and the most interesting part - pretraining the whole architecture.

Till then, stay tuned.

Building My First Large Language Model from Scratch

Ujjwal Raj — Fri, 03 Oct 2025 07:39:39 +0000

Welcome everyone. I would like to share my experience of building my own LLM from scratch. In this article, you will come across the details of LLM architecture. I followed a great book: Building Large Language Models (From Scratch) by Sebastian Raschka.

The whole point is that I built the GPT architecture piece by piece, layer by layer. Once done, it was impossible to train on the CPU/GPU I have locally, so I loaded the weights of GPT-2, which are publicly available from OpenAI. As the last part of the process, I also fine-tuned the model to solve classification problems like spam detection. The series will contain 2-3 articles - from building the GPT architecture to pre-training to fine-tuning.

The whole experience enriched my understanding of the deep inner workings of Large Language Models.

I will be adding some code snippets, which are just a glimpse of some aspects covered. You can skip those if not needed. They are optional and only serve to deepen understanding.

LLMs use a Decoder to predict the next word

Broadly speaking, the GPT architecture predicts the next word, which happens repeatedly. This iterative process generates entire new sentences, paragraphs, and even pages.

GPT-2 is an autoregressive, decoder-only model. Autoregressive models incorporate their previous outputs as inputs for future predictions.
How does it predict? The architecture will be explained step by step in the next sections.

Building the model

Building text tokenisation

This layer maps discrete objects (here, texts) to points in a continuous vector space.

Tiktoken has a public dataset with token values for all the vocabulary in GPT-2. This was used to create the tokenising layer. Tokenisation is simply converting every vocabulary item into an ID, which is an integer number. A total of 50,257 tokens are used in GPT-2.

You can learn more about how byte-pair encoding is used to generate token IDs here: https://www.geeksforgeeks.org/nlp/byte-pair-encoding-bpe-in-nlp/

Building text embeddings

An embedding layer is created which converts the tokens into embeddings, each of 768 dimensions. The layer corresponds to two steps.

Step 1: A torch embedding layer is created with the input dimension equal to vocabulary size and output size of 768 (for GPT-2 small). This neural network layer is trained during pre-training (via backpropagation).

Step 2: A positional embedding is calculated and added to the token embedding to get the final vector embedding. The positional embedding has the same dimension as the token embedding. This is calculated by taking values [0,1,2,3…] and embedding them using a torch embedding layer.

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

positional_embedding_layer = torch.nn.Embedding(context_len, output_dim)
positional_embedding = positional_embedding_layer(torch.arange(context_len))

# Calculating vector embedding
input_embedding = token_embedding + positional_embedding

Building the Attention Mechanism to get the context vector

The attention mechanism is very important in LLMs. It allows each position in the input to consider the relevance of all other positions in a sequence.

You can learn more about attention mechanism basics here: https://www.ibm.com/think/topics/attention-mechanism.

In this implementation, a multi-head attention mechanism is coded. Multiple instances of self-attention are created, each with its own set of weights. The outputs are then combined. Multi-head attention is computationally expensive but very important for recognizing complex patterns.

Multi-head attention is also called Scaled Dot-Product Attention.

Multi-head attention mechanism implementation

You can read more about multi-head attention here: https://www.geeksforgeeks.org/nlp/multi-head-attention-mechanism/.

In GPT-2 small, there are 12 attention heads. So each output vector of a head has a dimension of output_dim (768) / num_heads (12).

The three weights—W_query, W_key, and W_value—are trained later via backpropagation.

Keys, queries, and values are obtained by splitting the input embeddings (which already include positional embeddings) into multiple heads. Then, the dot product is calculated for each head. Masking is applied before calculating the attention scores. Finally, the outputs of all heads are concatenated.

From the above diagram, you can see that future attention scores are masked, allowing the model to generate new, accurate, and sensible tokens.

The following softmax function is used for calculating attention scores:

This is how each attention score is calculated for each head:

Later, all the heads are concatenated to obtain the final query vector.

Conclusion

We have now created the multi-head attention mechanism. This is a core part of the GPT model.

In the next article, we will implement the transformer block and output layers before attempting to pre-train the model.

Here’s a summary of what we’ve done so far and what’s upcoming:

See you in the next article.

The System Design Series

Ujjwal Raj — Tue, 12 Aug 2025 17:45:30 +0000

Welcome! This series is your gateway to understanding system design through the lens of volatility-based decomposition. Whether you’re just exploring architecture or already designing large-scale systems, you’ll find clear explanations, practical framework, and examples that make the concepts easy to follow.

Series Overview

This series covers fundamental and advanced concepts in system design using THE METHOD. It covers all the rules and explanations along with the framework and examples. It is based on the book Righting Software by Juval Löwy. It's designed for learners looking to build their foundations as well as professionals deepening their expertise. Some programming background is helpful, but I recommend starting from the beginning if you’re new.

About Me

I’ve worked across the spectrum—from fast-paced startups to Fortune 500 Big Tech companies - building systems that handle millions of users, designing intelligent, data-driven solutions, and tackling challenges in security and distributed systems.

I draw insights not just from my own experiences, but also from collaborating with talented, highly experienced engineers, as well as from the wisdom I gain from books. My writing brings all of these influences together so the community can grow along with me.

Feel free to connect with me at ujjwal.dev.to@gmail.com - I love engaging with my readers and I’m always up for exchanging ideas and learning together.

Explore the Series

How to Navigate

Most articles are standalone, while some build on previous topics. If you’re new, start from the top and progress sequentially.

Have questions or requests? Let me know in the comments or email me at mailto:ujjwal.dev.to@gmail.com!

Stay Connected

Bookmark this post for quick access to every article in the series.
Follow me on dev.to for the latest content.
Share with friends or colleagues who could benefit!

Distributed Systems Series: Your One-Stop Guide

Ujjwal Raj — Tue, 12 Aug 2025 10:46:14 +0000

Welcome! This post is your central hub to navigate the series on distributed systems. Whether you’re a beginner or a practicing engineer, you’ll find clear explanations, practical guides, and deep dives into key topics—all in one place.

Series Overview

This series covers fundamental and advanced concepts in distributed systems, including consistency models,networking , partitioning, practicality of eventual consistency, and much more. It's designed for learners looking to build their foundations as well as professionals deepening their expertise. Some programming and networking background is helpful, but I recommend starting from the beginning if you’re new.

About the Author:

Feel free to connect with me at ujjwal.dev.to@gmail.com - I love engaging with my readers and I’m always up for exchanging ideas and learning together.

Explore the Series

How to Navigate

Most articles are standalone, while some build on previous topics. If you’re new, start from the top and progress sequentially. New articles are sometimes updated in case I find something missing in the series in future.

Have questions or requests? Let me know in the comments or email me at mailto:ujjwal.dev.to@gmail.com!

Stay Connected

Bookmark this post for quick access to every article in the series.
Follow me on dev.to for the latest content.
Share with friends or colleagues who could benefit!

System Design Example using 'The Method'

Ujjwal Raj — Sun, 20 Jul 2025 07:46:11 +0000

Welcome to another article on system design on another Sunday. Every Sunday we have been discussing on how to design a good software system. This is going to be end of the series. Don't worry if you have not read the previous ones. We are carrying forward with us the rules that we keep on our finger tips.

The motive is to show that system design is not tech but an engineering art. The series is based on the Book Righting Software by Juval Lowey. I would be taking the same example in chapter 5 of the book in this article.

Here are the rules to follow:

Avoid functional decomposition (what we were doing in universities), and remember: a good system design speaks — through how components interact.
The client should not be the core business. Let the client be the client — not the system.
Decompose based on volatility — list the areas of volatility.
There is rarely a one-to-one mapping between a volatility area and a component.
List the requirements, then identify the volatilities using both axes: — What can change for existing customers over time? — Keeping time constant, what differs across customers? What are different use cases? (Remember: Almost always, these axes are independent.)
Verify whether a solution/component is masquerading as a requirement. Verify it is not variability. A volatility is not something that can be handled with if-else; that’s variability.
Use the layered approach and proper naming convention:
- Names should be descriptive and avoid atomic business verbs.
- Use <NounOfVolatility>Manager, <Gerund>Engine, <NounOfResource>Access.
- Atomic verbs should only be used for operation names, not service names.
layers given in the template should correspond to 4 question:
- Client : Who interacts withe system
Managers : What is required of the system
Engines : How system performs the business logic
ResourceAccess : How system access the resources
Resource : Where is the system state
Validate if your design follows the golden ratio (Manager:Engine). Some valid ratios: 1:(0/1), 2:1, 3:2, 5:3.

More than 5 Managers? You’re likely going wrong.
Volatility decreases from top to bottom, while reusability increases from top to bottom.
Slices are subsystems. No more than 3 Managers in a subsystem.
Design iteratively, build incrementally.
Design with the smallest set of reusable components needed to support core use cases. A good architecture integrates ~10–20 components to support them composably. Features are outcomes of integration, not implementation.
Design Don'ts
1. A Client should not call multiple Managers for a single use case.
2. A Client must not call Engines.
3. Managers must not queue calls to more than one Manager in the same use case. The need to have two (or more) Managers respond to a queued call is a strong indication that more Managers (and maybe all of them) would need to respond — so you should use a Pub/Sub Utility service instead.
4. Engines and ResourceAccess services do not receive queued calls.
5. Clients, Engines, ResourceAccess, or Resource components do not publish events.
6. Engines, ResourceAccess, and Resources do not subscribe to events. This must be done in a Client or a Manager.
7. Engines never call each other.
8. ResourceAccess services never call each other.

System Design Example

TradeMe is a platform that connects independent tradesmen (like plumbers, electricians, carpenters, etc.) with contractors who need skilled labor for construction projects. Tradesmen can list their skills, availability, location, and expected rate, while contractors can post project requirements, locations, desired skills, and budgets.

Key points:

Matching System: Facilitates dynamic matching of contractors and tradesmen based on skills, timing, and rates.
Market-Driven Pricing: Rates are influenced by discipline, skill level, experience, location, project type, supply-demand, risk, weather, and certification.
Project Flexibility: Tradesmen may be hired for varying durations, often joining and leaving projects at different times.
Payment & Compliance: TradeMe handles payments, logs hours, manages regulatory compliance, and prevents direct contractor-tradesman arrangements.
Revenue Model: Earns through a rate spread and annual membership fees from both contractors and tradesmen.
Call Centers: Nine localized centers with reps manage project-tradesman assignments based on local laws and codes.
Competition: A rival app focusing on cheaper labor is gaining traction, appealing to cost-focused contractors.

The Core Use Case

There is only one core use case - TradeMe is a system for matching
tradesmen to contractors and projects.

The Anti Design Effort

The following figure is the anti design which needs to be avoided at any cost. This is merely based on functional decomposition.

The Architecture

Who

Tradesmen: Independent, self-employed workers with specialized skills.
Contractors: General contractors who hire tradesmen for specific tasks on construction projects.
TradeMe Reps: Account representatives who manage matching and scheduling via call centers.
Education Centers: Institutions providing certification and continuing education for tradesmen.
Background Processes: Automated system functions like payment scheduling and regulatory reporting.

What

Membership: Both tradesmen and contractors register and pay to use the platform.
Marketplace: A digital platform where contractors post projects and tradesmen offer services.
Certificates & Training: Tracks qualifications and ongoing training requirements for tradesmen.

How

Searching: Tradesmen and contractors use the platform to find each other based on skills, availability, and rates.
Regulatory Compliance: Ensures wages, taxes, certifications, and safety standards are met.
Access to Resources: Provides tools for scheduling, reporting, and payment handling.

Where

Local Database: Stores regional data specific to each call center’s jurisdiction.
Cloud: Hosts centralized application logic, user data, and scheduling algorithms.
Other Systems: Interfaces with regulatory bodies, educational institutions, and financial systems.

Areas of Volatility

The core idea behind decomposition is to identify areas of volatility — parts of the system likely to change — and encapsulate them to reduce the impact of change. The TradeMe design team carefully analyzed these volatilities and mapped them to specific architectural components. Here's a summary:

🧩 Key Observations from Volatility Analysis

Volatility ≠ Variability: For example, tradesman attributes may vary (e.g., adding skills) but do not impact architecture significantly, so it's not volatile.
Volatility Requires Clarity: Each candidate must be examined to determine why it's volatile and what risk it introduces.
Some volatilities are external: For instance, payments may change frequently, but since TradeMe doesn’t implement its own payment system, it's treated as an external Resource.

🧱 Identified Areas of Volatility and Their Encapsulation

Volatile Area	Encapsulated In	Notes
Client Applications	Client Application	Different UI tech, devices, access methods; volatile interfaces.
Membership Management	Membership Manager	Adding/removing users, benefits; varies by locale.
Fees and Revenue Models	Market Manager	How the system makes money is changeable.
Project Definitions	Market Manager	Size and scope impact workflows; central to matching logic.
Dispute Resolution	Membership Manager	Must handle fraud and misunderstandings.
Matching and Approvals	Search Engine & Market Manager	Matching logic and search criteria both volatile.
Education Workflow	Education Manager	Scheduling, searching for training; certification required.
Certifications & Regulations	Regulation Engine	Rules change frequently by region or over time.
Reports & Auditing	Regulation Engine	Compliance-driven volatility.
Localization (Language, UI)	Client Applications & Regulation Engine	Can affect UI and regulatory compliance.
External Resources (e.g., Payments)	Resources and ResourceAccess	Store data and connect to systems with volatile access tech.
Deployment Strategies	Message Bus & Subsystems	Geographic, cloud/on-premise, and modular deployment needs.
Authentication & Authorization	Security Utility	Flexible and diverse in real-world scenarios.

📌 Design Insight

Not all volatile aspects deserve their own component; grouping them logically (like in a Manager) avoids architectural complexity.
If a proposed decomposition leads to a tangled system or asymmetry, it’s a sign of poor design.
Volatility-based decomposition aligns the system with business risk and adaptability.

This approach ensures that changes in business needs or external conditions have a localized impact, preserving the system’s stability and maintainability.

Static Architecture

Some components dependency diagrams

The Add Tradesman/Contractor call chain

Request Tradesman call chains (until matching)

Call chains for the Match Tradesman use case

Call chains for the Assign Tradesman use case

Call chains for the Terminate Tradesman use case

Call chains for the Pay Tradesman use case

Call chains for the Create Project use case

Call chains for the Close Project use case

Conclusion

System design is not merely an academic exercise or a diagramming ritual — it's the art of managing change. Through this series, we’ve consistently reinforced one truth: volatility is the compass of good architecture. By identifying and encapsulating the parts of a system most likely to change, we build not only for functionality today, but for adaptability tomorrow.

Using TradeMe as a running example, we’ve explored how to decompose a system based on business-driven change vectors, not arbitrary technical boundaries. We've learned to avoid the traps of functional decomposition and instead think in terms of clients, managers, engines, and resources, each aligned to a specific responsibility in the system.

While this is the end of the series, it should be the beginning of a new lens through which you view system design. Design iteratively, build incrementally, and always measure your architecture against the axis of volatility. That is how software stays simple, stable, and scalable — even in the face of constant change.

Thanks for reading every Sunday.

Here are links to previous articles in case you missed them:

System Design Isn’t About Requirements — It’s About Change

Ujjwal Raj — Sun, 13 Jul 2025 07:33:42 +0000

Welcome to another Sunday Blog on System Design.
This Sunday, we’ll explore how to deal will requirements and changes while designing a system.

Here are the rules we've already summarized. As I often say:
Learn the rules, follow the rules — and only once you're good at it, should you even think about bending them.

I suggest going through the these two articles to fully grasp the ideas discussed here: Part 1 | Part 2

Avoid functional decomposition (what we were doing in universities), and remember: a good system design speaks — through how components interact.
The client should not be the core business. Let the client be the client — not the system.
Decompose based on volatility — list the areas of volatility.
There is rarely a one-to-one mapping between a volatility area and a component.
List the requirements, then identify the volatilities using both axes: — What can change for existing customers over time? — Keeping time constant, what differs across customers? What are different use cases? (Remember: Almost always, these axes are independent.)
Verify whether a solution/component is masquerading as a requirement. Verify it is not variability. A volatility is not something that can be handled with if-else; that’s variability.
Use the layered approach and proper naming convention:
- Names should be descriptive and avoid atomic business verbs.
- Use <NounOfVolatility>Manager, <Gerund>Engine, <NounOfResource>Access.
- Atomic verbs should only be used for operation names, not service names.
layers given in the template should correspond to 4 question:
- Client : Who interacts withe system
Managers : What is required of the system
Engines : How system performs the business logic
ResourceAccess : How system access the resources
Resource : Where is the system state
Validate if your design follows the golden ratio (Manager:Engine). Some valid ratios: 1:(0/1), 2:1, 3:2, 5:3.

More than 5 Managers? You’re likely going wrong.
Volatility decreases from top to bottom, while reusability increases from top to bottom.
Slices are subsystems. No more than 3 Managers in a subsystem.
Design iteratively, build incrementally.
Design Don'ts
1. A Client should not call multiple Managers for a single use case.
2. A Client must not call Engines.
3. Managers must not queue calls to more than one Manager in the same use case. The need to have two (or more) Managers respond to a queued call is a strong indication that more Managers (and maybe all of them) would need to respond — so you should use a Pub/Sub Utility service instead.
4. Engines and ResourceAccess services do not receive queued calls.
5. Clients, Engines, ResourceAccess, or Resource components do not publish events.
6. Engines, ResourceAccess, and Resources do not subscribe to events. This must be done in a Client or a Manager.
7. Engines never call each other.
8. ResourceAccess services never call each other.

Requirements and Changes

Requirements change. Accept it—that’s what requirements do.

Changing requirements drive demand for software professionals, ensuring job security and better compensation.

Designing systems strictly based on initial requirements is a flawed and painful approach. Despite being a common practice, it often leads to failure because requirements are inherently incomplete, inaccurate, or subject to change. Capturing every use case up front is nearly impossible, and even if it were done perfectly, changes are inevitable. Designing against rigid requirements leads to wasted effort, rework, and frustration. Instead, systems should be built to accommodate change, as designing against static requirements is ultimately futile.

Core Use Case

In any system, most use cases are simply variations of a few essential behaviors. These essential behaviors are called core use cases, which represent the fundamental business needs of the system and rarely change. All other use cases—such as error handling, customer-specific adaptations, or incomplete scenarios—are non-core and change frequently.

Despite a system potentially having hundreds of use cases, there are usually only 1 to 5 core use cases. These are not always clearly stated in requirement documents and must be discovered through analysis and abstraction. Identifying core use cases is a key responsibility of the architect, often requiring iteration and collaboration with stakeholders. While you shouldn't design directly against detailed requirements, analyzing them helps reveal what’s truly core and what is volatile.

As an architect, your primary goal is to identify the smallest set of components needed to support all core use cases. Since non-core use cases are just variations of these, they can be handled by different interactions among the same components—not by changing the architecture itself.

This approach is called composable design. It focuses on building flexible, reusable components rather than targeting specific use cases, which are often incomplete, inconsistent, and subject to change. Implementation-level changes (like integration logic inside managers) may occur, but the architecture remains stable.

Composable design makes systems resilient to requirement changes and enables validation by checking if all core use cases can be satisfied through specific component interactions. This can be done with call chain diagrams, which show how components interact to fulfill a use case, offering a practical way to verify the design without needing perfect or complete requirements.

Here is an example of call chain diagram:

Call chain diagrams are a fast and simple way to validate whether a system design can support a specific use case by showing interactions between components. However, they have limitations—they don’t show the order, duration, or frequency of calls, and can become unclear with complex interactions. Despite this, they are often sufficient for basic validation and are especially useful for communicating with nontechnical stakeholders due to their simplicity.

Smallest set

As an architect, your mission is to design the smallest possible set of components that can support all core use cases, minimizing complexity and development effort. "Smallest" doesn’t mean a single monolithic component, nor does it mean one component per use case—both extremes are poor designs due to internal complexity or high integration cost.

Instead, aim for an architecture with around 10–20 components, which strikes a balance between simplicity and flexibility. This range, seen across various systems (like the human body or a car), is powerful due to combinatorics: a small number of reusable components can be combined in many ways to support numerous use cases.

Good architecture encapsulates volatility and uses logical layers (e.g., Managers, Engines, Resources) to remain adaptable as requirements evolve. Once you’ve reached a component set that can’t be reasonably reduced further without compromising clarity or function, you’ve found your optimal design—your smallest set.

Design Duration

Identifying core use cases and areas of volatility may take weeks or months, but that’s part of requirements analysis, not design. Once this groundwork is done, producing a valid design using composable principles (like The Method) should take a day or a week at most—and with experience, possibly just a few hours. The key idea is that design itself is fast when you're clear on what the system truly needs.

Handling Change

A fundamental rule of system design is:

Features are always aspects of integration, not implementation.

This means features emerge not from isolated code or components, but from how those components are combined. It's a universal and fractal rule—whether it's a car transporting you or a laptop enabling word processing, the feature arises from integration, not any single part.

Trying to implement features directly (as if they were standalone pieces of code) goes against how systems truly work. Functional decomposition, which focuses on coding features in isolation, leads to rigid, fragile systems that are hard to change—since changes affect many areas at once.

Fighting change by deferring it or dismissing user needs kills a system. Customers need immediate solutions, not promises for the next release. A system that can't adapt quickly will be abandoned, even if it's still technically alive. To keep a system alive and relevant, architecture must embrace change—and fast response to evolving requirements is essential.

The key to handling change is not avoiding it—but containing its impact. In a well-architected system using volatility-based decomposition (as defined in The Method), changes typically affect use cases, which are implemented by Managers. While a Manager might need to be rewritten due to a change in behavior, the core components it integrates—Engines, ResourceAccess, Resources, Utilities, and Clients—remain intact.

This structure ensures that:

Managers are expendable and cheap to rewrite.
Most of the system's effort and complexity lies in the reusable components beneath the Manager.
By preserving and reusing these components, you contain the cost and effort of adapting to change.

This approach allows rapid adaptation without major rewrites—true agility. You don’t redesign the whole system when a requirement changes—you just rewire the existing pieces.

Updated rules

Lets update the set of rules we learnt. We will keep these on finger tips while designing system in future articles.

Avoid functional decomposition (what we were doing in universities), and remember: a good system design speaks — through how components interact.
The client should not be the core business. Let the client be the client — not the system.
Decompose based on volatility — list the areas of volatility.
There is rarely a one-to-one mapping between a volatility area and a component.
List the requirements, then identify the volatilities using both axes: — What can change for existing customers over time? — Keeping time constant, what differs across customers? What are different use cases? (Remember: Almost always, these axes are independent.)
Verify whether a solution/component is masquerading as a requirement. Verify it is not variability. A volatility is not something that can be handled with if-else; that’s variability.
Use the layered approach and proper naming convention:
- Names should be descriptive and avoid atomic business verbs.
- Use <NounOfVolatility>Manager, <Gerund>Engine, <NounOfResource>Access.
- Atomic verbs should only be used for operation names, not service names.
layers given in the template should correspond to 4 question:
- Client : Who interacts withe system
Managers : What is required of the system
Engines : How system performs the business logic
ResourceAccess : How system access the resources
Resource : Where is the system state
Validate if your design follows the golden ratio (Manager:Engine). Some valid ratios: 1:(0/1), 2:1, 3:2, 5:3.

More than 5 Managers? You’re likely going wrong.
Volatility decreases from top to bottom, while reusability increases from top to bottom.
Slices are subsystems. No more than 3 Managers in a subsystem.
Design iteratively, build incrementally.
Design with the smallest set of reusable components needed to support core use cases. A good architecture integrates ~10–20 components to support them composably. Features are outcomes of integration, not implementation.
Design Don'ts
1. A Client should not call multiple Managers for a single use case.
2. A Client must not call Engines.
3. Managers must not queue calls to more than one Manager in the same use case. The need to have two (or more) Managers respond to a queued call is a strong indication that more Managers (and maybe all of them) would need to respond — so you should use a Pub/Sub Utility service instead.
4. Engines and ResourceAccess services do not receive queued calls.
5. Clients, Engines, ResourceAccess, or Resource components do not publish events.
6. Engines, ResourceAccess, and Resources do not subscribe to events. This must be done in a Client or a Manager.
7. Engines never call each other.
8. ResourceAccess services never call each other.

Conclusion

Starting next week, we’ll begin exploring real-world examples of software design using The Method.
See you next Sunday!

Here are links to previous articles in case you missed them:

Design Don’ts in System Design with ‘The Method’

Ujjwal Raj — Sun, 06 Jul 2025 07:38:22 +0000

Welcome to another Sunday Blog on System Design.
This Sunday, we’ll explore the don’ts to keep in mind when designing a system using The Method.

Here are the rules we've already summarized. As I often say:
Learn the rules, follow the rules — and only once you're good at it, should you even think about bending them.

I suggest going through the last two articles to fully grasp the ideas discussed here: Part 1 | Part 2

Avoid functional decomposition (what we were doing in universities), and remember: a good system design speaks — through how components interact.
The client should not be the core business. Let the client be the client — not the system.
Decompose based on volatility — list the areas of volatility.
There is rarely a one-to-one mapping between a volatility area and a component.
List the requirements, then identify the volatilities using both axes: — What can change for existing customers over time? — Keeping time constant, what differs across customers? What are different use cases? (Remember: Almost always, these axes are independent.)
Verify whether a solution/component is masquerading as a requirement. Verify it is not variability. A volatility is not something that can be handled with if-else; that’s variability.
Use the layered approach and proper naming convention:
- Names should be descriptive and avoid atomic business verbs.
- Use <NounOfVolatility>Manager, <Gerund>Engine, <NounOfResource>Access.
- Atomic verbs should only be used for operation names, not service names.
layers given in the template should correspond to 4 question:
- Client : Who interacts withe system
Managers : What is required of the system
Engines : How system performs the business logic
ResourceAccess : How system access the resources
Resource : Where is the system state
Validate if your design follows the golden ratio (Manager:Engine). Some valid ratios: 1:(0/1), 2:1, 3:2, 5:3.

More than 5 Managers? You’re likely going wrong.
Volatility decreases from top to bottom, while reusability increases from top to bottom.
Slices are subsystems. No more than 3 Managers in a subsystem.
Design iteratively, build incrementally.

Now let's see how these don'ts are listed by opening a closed architecture to a semi-closed one.

Open and Closed Architecture

In an open architecture, any component can call any other component in any layer. The flexibility is there, but it's a bad design. There is too much coupling, and encapsulation is heavily sacrificed.

There are only two problems in Software Engineering:

Dependency management
Information hiding

Both problems are intensified when we use an open architecture. Even calling horizontally (intra-layer) creates some amount of coupling, so it should be avoided.

On the other hand, a closed architecture restricts access to only the immediate lower layer, allowing limited adjacent-layer interactions without skipping layers. Closed architecture promotes decoupling by trading flexibility for encapsulation. In general, that is a better trade than the other way around.

Semi-Closed/Semi-Open Architecture

A semi-closed/semi-open architecture allows calling more than one layer down. It still does not allow calling up or sideways.

E.g. A Manager can call a ResourceAccess component if there is nothing to encapsulate in an Engine layer.

This method relaxes the closed architecture with some don'ts and do's, making sure encapsulation is not compromised at the cost of flexibility.

Flexibilities Allowed in 'The Method'

Calling Utilities

In a closed architecture, placing Utilities like Logging, Security, or Diagnostics is problematic because every component may need them. If Utilities are assigned to a specific layer, access becomes limited. To solve this, The Method introduces a vertical utility bar that cuts across all layers, making Utilities accessible to every component.
E.g. Logging, Authentication, Security

Litmus Test for Utility

To avoid the misuse of utilities by developers, there is a simple litmus test:
Can the component plausibly be used in any other system, such as a smart cappuccino machine?

For example, a smart cappuccino machine could use a Security service to check if the user is authorized to drink coffee. Similarly, it may want to log how much coffee office workers drink, run diagnostics, and publish events (e.g., running low on coffee) using a Pub/Sub service.

Each of these needs justifies encapsulation in a Utility service. In contrast, you’d be hard-pressed to explain why a cappuccino machine needs a mortgage interest calculation service as a Utility.

Calling ResourceAccess by Business Logic

Managers and Engines can access ResourceAccess since they’re in the same layer. There might be cases where Managers need to reach Resources even without using Engines.

Managers Calling Engines

Managers can call Engines directly because Engines serve as strategies within Manager workflows. This separation is more about design detail than architectural structure. These calls aren't lateral, as Engines function in a different dimension from Managers.

Queued Manager-to-Manager

While Managers should not call other Managers directly (i.e., sideways), a Manager can queue a call to another Manager.

Technically, the queue itself is a Resource, and the publisher acts like a ResourceAccess component. The queue listener is effectively another Client in the system, calling downward to the receiving Manager. No true sideways call actually takes place.

Design Don'ts

Treat any violation of these rules as a red flag and investigate further to see what you might be missing.

A Client should not call multiple Managers for a single use case.
A Client must not call Engines.
Managers must not queue calls to more than one Manager in the same use case. The need to have two (or more) Managers respond to a queued call is a strong indication that more Managers (and maybe all of them) would need to respond — so you should use a Pub/Sub Utility service instead.
Engines and ResourceAccess services do not receive queued calls.
Clients, Engines, ResourceAccess, or Resource components do not publish events.
Engines, ResourceAccess, and Resources do not subscribe to events. This must be done in a Client or a Manager.
Engines never call each other.
ResourceAccess services never call each other.

Updated Rules

Let's update the rules based on what we have discussed today. On top of that, I am adding an additional observation:

There may not be internal symmetry inside a component like a Manager, but strive toward symmetry in the overall design.

Avoid functional decomposition (what we were doing in universities), and remember: a good system design speaks — through how components interact.
The client should not be the core business. Let the client be the client — not the system.
Decompose based on volatility — list the areas of volatility.
There is rarely a one-to-one mapping between a volatility area and a component.
List the requirements, then identify the volatilities using both axes: — What can change for existing customers over time? — Keeping time constant, what differs across customers? What are different use cases? (Remember: Almost always, these axes are independent.)
Verify whether a solution/component is masquerading as a requirement. Verify it is not variability. A volatility is not something that can be handled with if-else; that’s variability.
Use the layered approach and proper naming convention:
- Names should be descriptive and avoid atomic business verbs.
- Use <NounOfVolatility>Manager, <Gerund>Engine, <NounOfResource>Access.
- Atomic verbs should only be used for operation names, not service names.
layers given in the template should correspond to 4 question:
- Client : Who interacts withe system
Managers : What is required of the system
Engines : How system performs the business logic
ResourceAccess : How system access the resources
Resource : Where is the system state
Validate if your design follows the golden ratio (Manager:Engine). Some valid ratios: 1:(0/1), 2:1, 3:2, 5:3.

More than 5 Managers? You’re likely going wrong.
Volatility decreases from top to bottom, while reusability increases from top to bottom.
Slices are subsystems. No more than 3 Managers in a subsystem.
Design iteratively, build incrementally.
Design Don'ts
1. A Client should not call multiple Managers for a single use case.
2. A Client must not call Engines.
3. Managers must not queue calls to more than one Manager in the same use case. The need to have two (or more) Managers respond to a queued call is a strong indication that more Managers (and maybe all of them) would need to respond — so you should use a Pub/Sub Utility service instead.
4. Engines and ResourceAccess services do not receive queued calls.
5. Clients, Engines, ResourceAccess, or Resource components do not publish events.
6. Engines, ResourceAccess, and Resources do not subscribe to events. This must be done in a Client or a Manager.
7. Engines never call each other.
8. ResourceAccess services never call each other.

Conclusion

Starting next week, we’ll begin exploring about the use cases and requirements.
See you next Sunday!

Here are links to previous articles in case you missed them:

Template for System Design Using ‘The Method’: Part II

Ujjwal Raj — Sun, 29 Jun 2025 07:33:38 +0000

Welcome to another Sunday article where we discuss System Design.

This is the first article in the series where I recommend the reader to read the previous article — Template for System Design Using ‘The Method’.

Here are the rules we've already summarized. I keep repeating: we will learn the rules, follow the rules, and once we are good at it, only then will we bend the rules.

Avoid functional decomposition (what we were doing in universities), and remember: a good system design speaks — through how components interact.
The client should not be the core business. Let the client be the client — not the system.
Decompose based on volatility — list the areas of volatility.
There is rarely a one-to-one mapping between a volatility area and a component.
List the requirements, then identify the volatilities using both axes: — What can change for existing customers over time? — Keeping time constant, what differs across customers? (Remember: these axes are independent.)
Verify whether a solution/component is masquerading as a requirement. Verify it is not variability. A volatility is not something that can be handled with if-else; that’s variability.
Use the layered approach and proper naming convention:
- Names should be descriptive and avoid atomic business verbs.
- Use <NounOfVolatility>Manager, <Gerund>Engine, <NounOfResource>Access.
- Atomic verbs should only be used for operation names, not service names.

Now we will discuss some more aspects of the template and update the rules.

Four Questions

The layers given in the template should correspond to four questions:

Client: Who interacts with the system
Managers: What is required of the system
Engines: How the system performs the business logic
ResourceAccess: How the system accesses the resources
Resource: Where the system state is stored

These questions can be asked both before initiating the design and while validating the design.

At the beginning, asking “what” helps list potential candidates for Managers. But remember, components need not be perfect.

Once done with the design, ask the questions again for validation:
Are all your Clients “who,” with no trace of “what”?
Are all the Managers “what,” without a smidgen of “who” or “where”?

Again, the mapping of questions to layers will not be perfect. Some overlap may occur. However, if you're confident that the encapsulation of volatility is justified, there's no need to doubt that choice. If you're not convinced, the questions may indicate a red flag — a signal to revisit your decomposition.

The Golden Ratio

In the book Righting Software by Juval Löwy, he shares observations from years of experience as a successful architect. I follow the same ratios as a source of truth for validations:

1 Manager → 0 or 1 Engine
2 Managers → 1 Engine
3 Managers → 2 Engines
5 Managers → 3 Engines
8 Managers → A lot! You've likely failed to decompose volatility. Redesign.

Volatility Decreases Top-Down, Reuse Increases

Clients are more volatile than Managers. Engines are less volatile than Managers, and so on. A design in which volatility decreases down the layers is extremely valuable. Components in the lower layers often have more dependencies. If the components you depend on the most are also the most volatile, your system will implode.

Reuse of services increases from top to bottom. Clients are hardly reusable. Managers are somewhat reusable — e.g., via both phone and web clients. Engines are even more reusable, as different use cases might share core logic.

Services and Subsystems

Group services by use case and divide them into slices. A slice is a subsystem. You'll get a deeper understanding in the next article, where I'll walk through an example.
Make sure you don't have more than 3 Managers in a subsystem.

Developing a system like this can be done one slice/subsystem at a time.
Design iteratively, build incrementally.
A component inside a subsystem cannot be released independently — it won't cover a complete use case. So release one subsystem at a time during development.

Extensibility

By extending the system, we do not mean reopening components and developing them again. If you have designed correctly for extensibility, you can mostly leave existing parts untouched and extend the system holistically — by simply adding more slices or subsystems.

What Are Microservices Here?

Do not assume each slice is a microservice. Creating a service per slice can result in an unnecessary number of microservices, leading to excessive HTTP/TCP communications, unreliability, and complexity.

Internal services like Engines and ResourceAccess should rely on fast, reliable, high-performance communication channels — such as TCP/IP, named pipes, IPC, domain sockets, Service Fabric remoting, custom in-memory interception chains, message queues, etc.

A microservice can be a group of slices if you have many of them.

Updated rules

Let’s update the rules of the framework:

Avoid functional decomposition (what we were doing in universities), and remember: a good system design speaks — through how components interact.
The client should not be the core business. Let the client be the client — not the system.
Decompose based on volatility — list the areas of volatility.
There is rarely a one-to-one mapping between a volatility area and a component.
List the requirements, then identify the volatilities using both axes: — What can change for existing customers over time? — Keeping time constant, what differs across customers? What are different use cases? (Remember: Almost always, these axes are independent.)
Verify whether a solution/component is masquerading as a requirement. Verify it is not variability. A volatility is not something that can be handled with if-else; that’s variability.
Use the layered approach and proper naming convention:
- Names should be descriptive and avoid atomic business verbs.
- Use <NounOfVolatility>Manager, <Gerund>Engine, <NounOfResource>Access.
- Atomic verbs should only be used for operation names, not service names.
layers given in the template should correspond to 4 question:
- Client : Who interacts withe system
Managers : What is required of the system
Engines : How system performs the business logic
ResourceAccess : How system access the resources
Resource : Where is the system state
Validate if your design follows the golden ratio (Manager:Engine). Some valid ratios: 1:(0/1), 2:1, 3:2, 5:3.

More than 5 Managers? You’re likely going wrong.
Volatility decreases from top to bottom, while reusability increases from top to bottom.
Slices are subsystems. No more than 3 Managers in a subsystem.
Design iteratively, build incrementally.

Conclusion

In the next article, I will make the given framework more solid with some don'ts and do's. Once we catch those, we then start delving into design examples.

See you next Sunday!
Stay Tuned!

Here are links to previous articles in case you missed them:

Template for System Design Using ‘The Method’

Ujjwal Raj — Sat, 21 Jun 2025 09:41:48 +0000

Welcome to another weekly article on System Design. In this article, I will give you a template for how to design systems using 'The Method'.

As I’ve already stated: we are going to learn the rules, follow the rules, and once you are an expert — bend the rules.

Let’s keep track of what we’ve learned so far, so everything stays at our fingertips even if you haven’t read the previous article:

Avoid functional decomposition (what we were doing in universities), and remember: a good system design speaks — through how components interact.
The client should not be the core business. Let the client be the client — not the system.
Decompose based on volatility — list the areas of volatility.
There is rarely a one-to-one mapping between a volatility area and a component.
List the requirements, then identify the volatilities using both axes: — What can change for existing customers over time? — Keeping time constant, what differs across customers? (Remember: these axes are independent.)
Verify whether a solution/component is masquerading as a requirement. Verify it is not variability. A volatility is not something that can be handled with if-else; that’s variability.

Let’s now start diving into the template.

The Use Case

A use case is an expression of required behavior — how the system is going to accomplish something to add business value. It describes the end-user interaction with the system or the system’s interaction with the user. It also illustrates system-to-system interactions and processing. It is always recommended to capture use cases graphically. Nothing communicates better and more simply than pictures.

An activity diagram showing all possibilities can also be drawn.

The two figures below show how a use case diagram (left) and an activity diagram (right) look:

It should be noted that it is not feasible to draw activity diagrams for all possible cases. So, only important ones should be covered, leaving the simple ones aside — they will be understood naturally through the layered approach we’re about to cover. Get ready for the ultimate framework.

Layered Approach

A layered approach is a better representation of systems and services. The encapsulations are layered:

a) Each layer encapsulates the volatility of itself and of the layers below it, from the layers above.
b) Within a layer, the services encapsulate volatility from each other. (Reminder: by “services,” I do not mean microservices.)

The following image is the same one I used to define a good design decision tree in the very first article. Don’t worry if you haven’t read it — but if you have, you’ll relate to it:

The image below is taken from the book Righting Software by Juval Löwy. I use the same color template while designing systems, and we’ll continue using it throughout the series:

It should be obvious that the number of layers in a practical system is limited. These layers terminate at a resource layer — such as a database, file storage, or third-party API.

The Services

As I’ve already emphasized in previous articles: system design is not just tech — it is engineering.

With service flow diagrams, the entire data flow can be understood very well. This is one of the key advantages of volatility-based decomposition. Designing in this layered manner makes service calls visualizable, giving us insights even before deployment — and certainly after some traffic is observed.

Things like security, scalability, traffic, throughput, responsiveness, consistency concerns, and synchronization can all be examined more clearly.

The General Template Framework for System Design

Here is the general framework of 'The Method'. We will discuss each section.

The Client

These are the entry points to the system. Make sure your client is only a client — not a system. The client should not encapsulate business logic. All clients should use the same entry point to the system. A phone app and a web app should not be calling different backends.

Volatility by client:

Axis 1 (What can change for existing customers?): Different users may demand different accessibility options, dark mode, etc.
Axis 2 (What differs across customers?): It can be a desktop app, Android app, or just an API interface. Technology may vary — React, .NET, etc.

The Business Logic Layer

This layer encapsulates the system’s use cases — i.e., what the system is supposed to do from a business perspective. The components here are Managers and Engines.

Manager: Encapsulates volatility in the sequence.
Engine: Encapsulates volatility in the activity.

A Manager can utilize zero or more Engines to complete a business task.

For example, a car starting system may look like:

MovementManager.Start();

And within the Manager:

PreparingEngine.ConfirmAdjustSeat();
PreparingEngine.AdjustSideGlass();
PreparingEngine.CloseWindow();

DrivingEngine.AdjustBrake();
DrivingEngine.AdjustGear();
DrivingEngine.AdjustAccelerator();
DrivingEngine.AdjustClutch();
DrivingEngine.StartICEngine();

Here, the Manager (MovementManager) is utilizing multiple Engines — PreparingEngine, DrivingEngine.

Ensure that two managers do not use two different engines to do the same job. That’s a symptom of functional decomposition.

Engines can be reused between Managers, as the same activity might appear in different use cases. Design Engines with reuse in mind.

The Resource Access Layer

Components here are called ResourceAccess.

Volatility by ResourceAccess:

This layer defines how to access a resource — but it should not expose contracts like Read(), Write(), Open(), Close().

Instead, it should use business verbs, such as:

CurrentFuelVolume()
GetMusicPlaylists()

Using database-style contracts creates tight coupling — any change in the resource's tech affects all upper layers. Avoid this.

ResourceAccess components may be reused across Engines or Managers needing access to the same resource.

The Resource Layer

This contains the actual physical resources — databases, file systems, caches, third-party APIs, etc. It encapsulates the tech being used. This can also include external systems like payment APIs.

Utility

The Utilities vertical bar (on the right of the diagram) includes common infrastructure services that most systems need:

Authentication
Logging
Messaging Queue
Pub/Sub, etc.

Rechecking the Layer Classification

Once you’ve classified your layers and components, ask these questions. If the answers are “yes,” you’ve done volatility-based decomposition — not functional decomposition.

Are the names descriptive? E.g., SomeManager, DoSomethingEngine, SomethingResourceAccess
For Managers, the prefix should be a noun tied to the encapsulated volatility (e.g., MovementManager)
For Engines, the prefix should be a noun describing the activity (e.g., DrivingEngine)
For ResourceAccess, the prefix should be a noun related to the resource (e.g., FuelResourceAccess)
Gerunds (nouns ending in -ing) should only be used with Engines. Their use elsewhere usually signals functional decomposition.
Atomic business verbs should not be used in service names — only as operation names in ResourceAccess contracts.

Learn these rules. They ensure we don’t fall into the trap of functional decomposition.

I’m Updating the Rules to Carry Forward

Let’s keep the rules on our fingertips for the next article:

Updated Rules

Avoid functional decomposition (what we were doing in universities), and remember: a good system design speaks — through how components interact.
The client should not be the core business. Let the client be the client — not the system.
Decompose based on volatility — list the areas of volatility.
There is rarely a one-to-one mapping between a volatility area and a component.
List the requirements, then identify the volatilities using both axes: — What can change for existing customers over time? — Keeping time constant, what differs across customers? (Remember: these axes are independent.)
Verify whether a solution/component is masquerading as a requirement. Verify it is not variability. A volatility is not something that can be handled with if-else; that’s variability.
Use the layered approach and proper naming convention:
- Names should be descriptive and avoid atomic business verbs.
- Use <NounOfVolatility>Manager, <Gerund>Engine, <NounOfResource>Access.
- Atomic verbs should only be used for operation names, not service names.

Conclusion

See you next Sunday in another article where I will dive with examples. Stay Tuned!

Here are links to previous articles in case you missed them:

Principles of Volatility-Based Decomposition in System Design

Ujjwal Raj — Sat, 14 Jun 2025 19:18:05 +0000

Welcome to another Sunday Blog on System Design. This Sunday, we will look into the principles of doing a volatility-based decomposition. Once we are done understanding and memorizing the rules, I will start transferring the framework for doing System Design as explained by Juval Löwy in his book Righting Software.

As I’ve already stated: we are going to learn the rules, follow the rules, and once you are an expert — bend the rules.

Let’s keep track of what we’ve learned so far, to keep things at our fingertips even if you haven’t read the previous article:

A good system design speaks (how components interact)
Avoid functional decomposition (what we were doing in universities)
The client should not be the core business. Let the client be the client — not the system
Reduce coupling between services as much as possible
Decompose based on volatility — list the areas of volatility
There is rarely a one-to-one mapping between a volatility area and a component
Symmetry is a good sign of good design

Once we start understanding and implementing the framework in the next article, we will verify the design using the memorized rules.

Listing the Volatility — Tools to Identify Them

Once a project starts, the requirements themselves are communicated in terms of functionalities — "the system should do so and so." It's the role of the architect to analyze the requirements and list the volatilities.
Another way to figure out the volatility is through customer interviews.

Volatile vs Variable

Volatility refers to changes that affect several components in the system, while variability refers to changes that can be handled in code using if-else. So it’s necessary to understand the difference between volatility and variability.

Axes of Volatility: Technique to Identify Volatility

One axis of change is over time for existing customers — how their requirements evolve. The business or expectations of the software’s consumer may shift over time.

The second axis of change occurs at the same time across different customers. Imagine freezing time and analyzing the second axis: Are all customers using the application in exactly the same way? What are the different use cases, and how can they be accommodated?

If something does not map to either axis of volatility, you should not encapsulate it at all — there should be no building block in your system to represent it. Creating such a block likely indicates functional decomposition.

These axes can serve as references during customer interviews to ask better questions, list down requirements, and later analyze volatility.

As you can see in the figure below, your design should follow a lifecycle while refactoring. Design iteratively like this:

In Figure A, you come up with one component — but that’s not good. Then, by analyzing along the axes of volatility, you realize that some things can be encapsulated into separate components. The result is Figure B. You keep factoring the design until all volatilities are encapsulated.

Another way to find volatility is by examining how competitors have designed their systems. That can reveal alternative approaches that can be explored across both axes.

Almost always, the axes should be independent

When we say the axes of volatility are independent, we mean the changes we consider using Axis 1 cannot be the same changes we calculate using Axis 2. If they are, it may be a case of functional decomposition.

Solutions Masquerading as Requirements

It is very common for an architect or developer to map a requirement from the spec directly into a component. We should always check whether that requirement is actually a solution masquerading as a requirement.

For example, assume “Cooking” is listed as a requirement in the design spec for a house. Isn’t it actually masquerading for “Feeding”? There are several alternatives to cooking — ordering food, or going out for dinner.

It’s exceedingly common for customers to suggest solutions as if they were requirements. Imagine you create a “Cooking” component, and later the customer demands a new “Pizza” component. Instead, having a “Feeding” component that encapsulates cooking, ordering pizza, or going out is more flexible and future-proof.

Volatility Listing

So before decomposing the system into components, the first step should be listing the volatilities after gathering requirements.
To do so, one should think along both the independent axes. Then, see what “solutions” are masquerading as requirements.

Conclusion

We have now seen the principles of volatility-based decomposition. Let’s update the rules we’ll follow when designing real systems in future blogs:

A good system design speaks (how components interact)
Avoid functional decomposition (what we were doing in universities)
The client should not be the core business. Let the client be the client — not the system
Reduce coupling between services as much as possible
Decompose based on volatility — list the areas of volatility
There is rarely a one-to-one mapping between a volatility area and a component
Symmetry is a good sign of good design
The required behavior should be accomplished by the interaction between various encapsulated areas of volatility
List the requirements, then identify the volatilities using both axes:
— What can change for existing customers over time?
— Keeping time constant, what differs across customers?What are different use cases?
(Remember: these axes are independent.)
Verify whether a solution/component is masquerading as a requirement
Verify it is not variability. A volatility is not which can be handled with if-else. Its variability.

In the next article next Sunday, we will start discussing a framework to actually do system design. Stay tuned!

Here are the links to previous articles in case you missed them. Its highly recommended to read the 4th one in the list which is about practical example of listing the volatility.