Deepak Singh

Posted on Aug 10

Do You Have a Good Software System? Of Course, Yes!

#designsystem #softwareengineering #softwaredevelopment #architecture

Do You Have a Good Software System? Of Course, Yes! Everyone will say that — until they get a failure that makes them lose millions.

Before building a good software system, we have to know what really is a good software system. What are the measures of a good software system?

After this article, you will get to know what is a good software system and why it is important. Or maybe I’ll tell you right now why it is important in short — yes, money. Money is the why.

With a good system, your application will be alive most of the time, in most situations, and for a longer time — meaning more money will flow through your accounts.

Why Things Get Complicated

Before diving into all these measures, we have to go through data systems — why things get complicated even when we have amazing modern tools to work with. We have tools for everything: servers, databases, frontend technologies, and more.

When a user is interacting with data stored in a DB, it’s not just like a direct operation on data like we do on our local system. Even in a simple system, there are multiple tools working together like caching tools, indexing tools, or message queues. All these work together to give the end user a fast and efficient response.

Example:

This is the data system where data is traveling through several tools, and we have to take care of every part of the tool. For example, if the name of a user is changed in the DB, it should also reflect in the cache memory.

And look here — for every part like caching, DB, and indexing, we are doing them separately. Even if a modern DB can provide caching, it is not as effective as caching as a separate tool. So, in any software system, a lot of tools are involved in the data system, which makes it complicated and has to be handled properly.

Measures of a Good Software System

1. Reliability

Most people are aware of a part of it and work on it most of the time. Reliability means the system works according to the requirements of the user. But not only this — the system should work even if some level of faults occur in components of the system, users send unreliable data to the DB, or the user should not be able to get unauthorized data.

The system should be fault-tolerant, meaning it should work in case of different kinds of faults:

Hardware faults: Hardware faults like system crashes, electricity going down, hard disk failures, and more. Physical devices are prone to faults and even have some lifespan after which they stop working. To handle this, we use things like RAID configurations, dual power supplies, and diesel supply in emergencies.
Software faults: Software can sometimes behave unpredictably due to several conditions like heavy CPU or data workload, defective data, legacy software, platform dependency, and others. If one component is affected, others should still work — meaning less coupling in the system.
Human faults: Mistakes are part of humans, and we can’t separate them from us. Human faults are always there in any software system, but they should not make the system stop working. While working, a separate sandbox should be provided to engineers with separate data to reduce risks.

2. Scalability

If the system is reliable at the current load, it doesn’t mean it will work at 10x the data load. We must make the system handle more data according to the growth of the application. Growth is always unpredictable, so ignoring scalability is risky (though in the case of an MVP, we can skip it).

We first need load parameters that define how load will increase. It may seem simple to just increase the number of user requests, but in most cases, it’s not.

Example: Twitter (2012 data)

Post Tweet: 4.6k requests/sec on average, over 12k/sec at peak.
Home Timeline: 300k requests/sec.

Two approaches:

A global DB with all tweets, all writes done here, and the feed fetched by matching followers in the DB.
Each user has their own cache with tweets from only their followers — reducing read complexity but increasing writes.

Twitter switched from the first to the second method for scalability. Now, followers per user became a critical load parameter. If someone has 30 million followers, write operations increase very rapidly.

When testing scalability, response time is key — but average response time is not a good metric. It hides slow requests. Amazon, for example, may have an average of 200ms, but its slowest 5% requests could be over 1 second — possibly from its most valuable users.

Percentiles like p50, p95, p99, and p999 are better metrics. They tell exactly how many requests are slower. But if improving extreme percentiles costs too much, avoid it.

Also, slow requests can block faster ones because of limited CPU or DB parallelism. In request chains, the probability of hitting the slowest request increases, making the whole chain slow.

3. Maintainability

At the start of building a software system, we focus on meeting requirements and implementing features. Over time, as the application grows, every new change requires going through older databases and code — which must be clean and organized to avoid productivity loss.

Key maintainability practices:

Avoid redundant complexity that doesn’t solve user problems.
Use abstraction to reduce coupling.
Keep monitoring systems active.
Practice Test-Driven Development (TDD).

Conclusion

Now you understand how to know if a system is a good software system. Go let your business boom in the market — and don’t let technology leave you behind. These metrics measure the goodness of a software system regardless of the language, tool, or framework you use.

Now whenever someone asks — "Do you have a good software system?" You can say: Of course, YES!

References

Kleppmann, M. (2017). Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O’Reilly Media.

DEV Community