Getting started with building bigger, faster and scalable systems (Part 1)

#saas #sre #webdev #distributedsystems

It Feels awesome to check out this community of dev. I just felt like re-posting my post from medium to here and get started with this great platform. Looking forward for greater learning and more fun :D

Following is a series of system design concepts which I feel like is much needed for every beginner to intermediate huge scale application designers/developers. Learning along the way :)

Part 1 -> Introduction to Data System and Reliability (This post)
Part 2 -> Scalability and Maintainability( Part 2 )

Alan Kay once said,

The Internet was done so well that most people think of it as a natural resource like the Pacific Ocean, rather than something that was man-made. When was the last time a technology with a scale like that was so error-free? The Web, in comparison, is a joke. The Web was done by amateurs.

Lately, data systems became an autonomous name under which typically databases, queues, caches and other components fall under.

Why even?
Now a days, a huge variety of tools such as data storage and processing has come out and they are optimized gradually in their own way for their respective use cases.

For Example:

Redis:
Traditional usage -> Datastore (in-memory caching)
Present usage -> Datastore and Message Queue
Apache Kafka:
Traditional usage -> Message Queue
Present usage -> Message Queue and Persistence like durability.
Thus the differences are much limited between such categories.

So application developer == data system engineer?

It has become a need for a every application developer to be a ‘subtle’ data system engineer as they are responsible for combining various system components to produce a composite data system which works seamlessly together. For example, they write transactional datastore code and as well as cache invalidation which work together to produce consistent results.

Basic important concerns of software design:

Reliability
Scalability (Part 2)
Maintainability (Part 2)

Okay, So What’s Reliability?
Whatever fault may occur. Let it be hardware, software or even human error. The system should continue working correctly. Period.!

Let’s take an example of social-networking app like Instagram.

Adding a photo should add the photo where it is expected to be which is how the user expects it to work.
Even if the user uses the software in a wrong way, the software should be able to tolerate. Example: Suppose if your software supports only one 👍(like) per person per photo and if an user tries to like n number of times, the software should be tolerant and should work only as expected.
The software performance shouldn't degrade under the expected load and data volume Example: If you have designed a system for expected 1M users with 1000:1 Read to write ratio, then the system should be tolerant for such load.
Unauthorized access should be blocked. Example: The user’s account can be accessed only by them and other security breach should be blocked. Things will go wrong. Things may fall. They are called Faults.

The Software system we build should be Fault-tolerant.

Important -> FAULT != FAILURE

Failure is when a system stops working and stops providing the required service. Whereas Fault is usually expected due to some deviation of the system and such unhandled faults will lead to Failure.

Some Causes:

Hardware Fault:

One of the most known cause of system failure is hardware failure. There might be any cause for the hardware failure which may include Disk crash, faulty RAM, Sudden power cut without backup and so on.
One of the ways to handle this is to have redundant hardware components. (i.e.) one can keep a new machine in sync with the one in production and if some fault occurs, then new machine should be able to replace it in no time. Thus reducing down time.

Software Fault:

Bugs!. Yeah, We have all been there. Unhandled exceptions, Faulty code, Wrongly utilized shared resources like CPU, Memory, Disk etc, Sending corrupted responses, Behaving weird for unexpected inputs, Cascading failure over lot of components (i.e) failure of one component leads to another and so on.!
Example: Ariane 5 Flight 501, a bug in the rocket code resulted in rocket crash.
Writing clean code, thinking of all the edge cases, analysing the system behavior and correct error handling helps great in building great software.

Human Errors:

Humans are unreliable. Period! We wont be even surprised if a human trips over a wire in a DC and system goes down.
These can be much reduced by following well designed APIs and monitoring tools for finding where the things actually went wrong. Also by reducing the whole human job and just automate them. Having a Sandbox for testing the production code is also very much helpful in this case.
Final takeaways related to Reliability:

System should be fault tolerant.
Automate whatever we could.
Have redundant or backup systems in the production.
Whatever happens, the service should never go down.
Let there be unavoidable faults but there should be no failures.
At any cause, try to avoid bugs :)

PS: Big thanks to Martin Kleppmann for the book “Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems book”
Cover image courtesy: Image by Free-Photos from Pixabay