DEV Community: Samuel Johnson R

Getting started with building bigger, faster and scalable systems (Part 2)

Samuel Johnson R — Fri, 31 Jul 2020 20:52:14 +0000

This part is going to get you started with knowing about what is scalability related to Data Systems.

Scalability:

Martin L. Abbott once said in Scalability Rules: 50 Principles for Scaling Web Sites,

Don’t accept that the application is too complex or that you release code too often as excuses that you can’t roll back. No sane pilot would take off in an airplane without the ability to land, and no sane engineer would roll code that they could not pull back off in an emergency.

Okay, so what is scalability?

Imagine, you built a software and released it for the customers and it is reliable for some days. One night after a shout-out by some celebrity made it to be viral over night and the load gets increased from 100’s of concurrent users to 100,000’s of concurrent users and keeps growing, will your system be able to process that much tons of data and will it be able to cope up with the increased load? The ability of the system to cope with such increased load is called scalability.

And yeah, It is not just “black and white” as it is meaningless to say “This system is scalable” or “This system doesn’t scale”. A proper thought should be given on it and prior discussions on how to add additional resources to handle the load increase and if the system keeps going in such a way, how are we going to cope up with it and so on.

Scalability can be understood by the following:

Defining the load.
Defining the performance.
How to cope performance with the load.

How can one define a load?

Load parameters or performance parameters can be used to define a load.
First of all one should know how much his system can handle and what happens if load increases crazily. The parameters to consider for defining the load depends completely on the purpose of the system.

Some of the parameters one can usually consider are:

Read to write ratio: A software like Facebook/Instagram where amount of read is so much greater than writing.
Requests per second.
Number of concurrent connections: Imagine a chat room where all users will be concurrently connected
Database querying delay.
Cache hit and miss. so on…

Coming to real time example. (key takeaways from Link)
By 2012 - Twitter, one of the greatest data loaded system was serving around 4600 post tweet requests per second on average and around 300,000 read tweet requests per second. The problem they were facing while writing tweet was not handling the number of requests but those tweets should be reached to all the people the user being followed by.

For Example: Now if Rihanna who has around 94Million followers in twitter, posts a tweet, then it should be reached to all the Millions of followers. That was the problem in scaling for them at first.

First they used Traditional Relational DB making classic joins and whenever a user requests his/her timeline, they query all the people that person follows timeline and get the respective posts, merge them sorted in time order.
This was having a heavy read load as queries were huge for getting their timeline.

Then they introduced a cache for a user’s timeline and whenever a person tweets, then the post will be cached in all the persons who the user being followed by. Thus the reads get much faster now but write for user with huge amount of followers get slowed down.

Then they started to follow a hybrid of first and second methods and thus scaling to millions is possible now.

Takeaways from this example:

Here the load of the system is amount of reads and writes for a user timeline.
They rightly balanced it by introducing right components like cache and also altering the architecture as needed.

How can one define the performance?

Defining the load and load parameter is very important and thus also knowing what will happen if the load increases is also equally important.

We can ask ourselves like

What happens if the load parameter increases, how will the system behave?
What happens when we increase the load parameter and how much resources should be increased for keeping the performance the same?

Usually the performance could be measured by response time i.e. the time taken from client sending the request to receiving the response. Usually we confuse latency with response time.

*Latency and response time is not the same: *
Latency is the waiting time for the request to be handled, whereas response time is what the client sees including the waiting time, processing the request time, network delays and also queuing time.

Usually average response time is taken as a metric for knowing the performance but it doesn’t exactly tell us how many users has experienced the delay in response.

That’s where median comes into place, On getting the metrics for huge amount of requests and if we sort them in ascending order of response, then the mid point is the medium. Example if it is 100ms, then we can assure that 50% of requests were having response time lesser than 100ms and 50% having more. This 50% median can be taken as p50 or 50th percentile. Similarly p95, p99, p99.9 are some of the good percentiles which can be taken for getting the metrics.

Having such metrics is very important. For example, Amazon has found that a increase of 100ms in response time has slowed down 1% of their sales. Refer.

A simple test with Apache bench (ab) gives a good basic test performance of the system with all the needed percentiles, average and median of the response time. More vigorous tests could be done for further million dollar worth decisions anyway :D

Performance and Load has been defined. Now how to maintain good performance in respect to load increase? 💭

To be very honest, there is no magical way to make every system hyper speed and scale to infinity and back. It takes gradual development of architecture along the way of load increase. i.e., your architecture may work terrific with current load, but if the load gets 10 times bigger, it should be flexible enough to rethink and change/update the architecture.

One of the first thought we get at first is Scaling up and Scaling out.

Scaling up and Scaling out

One of the most basic ways to improve performance is by increasing the resource of the system which is serving the requests. This does obviously increases the performance. But tech has a limit and you can’t keep on adding more RAM and CPU on and on to scale more. It costs hell lot after a point and it is merely impossible after a point while maintaining such a system is typically classic and easy. This method is called scaling up.

Scaling out is adding more small powered machines which work together to solve a problem or to handle requests. This method does give operational hardening, but it will be able to scale more and more on demand.

Usually a combination of both will be used by different software for different use cases as needed. There is nothing called as perfect architecture ever. It all varies on the use cases.

An architecture that scales well is the one which knows it’s load parameters correctly and correct assumptions and decisions revolving around that load parameter and does operates good.

Everything is a trade-off. Trading ‘time’ for ‘space’ and vice versa. Trading ‘scalability’ with ‘easy to maintain’ and so on.

More on how many build and scald complex systems could be probably followed in the future posts.
*Final takeaways related to Reliability: *

Understanding the load is important.
Median is typically better parameter to consider than mean while considering response time.
Should be ready for heavy load and have clear plan on how to cope up with it.
Having a clear load parameter specific to the system is needed.
Scaling up and Scaling out has it’s own benefits and fall backs.
System should be flexible for future changes.

PS: Again Big thanks to Martin Kleppmann for the book “Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems book”.

Getting started with building bigger, faster and scalable systems (Part 1)

Samuel Johnson R — Thu, 23 Jul 2020 19:34:03 +0000

It Feels awesome to check out this community of dev. I just felt like re-posting my post from medium to here and get started with this great platform. Looking forward for greater learning and more fun :D

Following is a series of system design concepts which I feel like is much needed for every beginner to intermediate huge scale application designers/developers. Learning along the way :)

Part 1 -> Introduction to Data System and Reliability (This post)
Part 2 -> Scalability and Maintainability( Part 2 )

Alan Kay once said,

The Internet was done so well that most people think of it as a natural resource like the Pacific Ocean, rather than something that was man-made. When was the last time a technology with a scale like that was so error-free? The Web, in comparison, is a joke. The Web was done by amateurs.

Lately, data systems became an autonomous name under which typically databases, queues, caches and other components fall under.

Why even?
Now a days, a huge variety of tools such as data storage and processing has come out and they are optimized gradually in their own way for their respective use cases.

For Example:

Redis:
Traditional usage -> Datastore (in-memory caching)
Present usage -> Datastore and Message Queue
Apache Kafka:
Traditional usage -> Message Queue
Present usage -> Message Queue and Persistence like durability.
Thus the differences are much limited between such categories.

So application developer == data system engineer?

It has become a need for a every application developer to be a ‘subtle’ data system engineer as they are responsible for combining various system components to produce a composite data system which works seamlessly together. For example, they write transactional datastore code and as well as cache invalidation which work together to produce consistent results.

Basic important concerns of software design:

Reliability
Scalability (Part 2)
Maintainability (Part 2)

Okay, So What’s Reliability?
Whatever fault may occur. Let it be hardware, software or even human error. The system should continue working correctly. Period.!

Let’s take an example of social-networking app like Instagram.

Adding a photo should add the photo where it is expected to be which is how the user expects it to work.
Even if the user uses the software in a wrong way, the software should be able to tolerate. Example: Suppose if your software supports only one 👍(like) per person per photo and if an user tries to like n number of times, the software should be tolerant and should work only as expected.
The software performance shouldn't degrade under the expected load and data volume Example: If you have designed a system for expected 1M users with 1000:1 Read to write ratio, then the system should be tolerant for such load.
Unauthorized access should be blocked. Example: The user’s account can be accessed only by them and other security breach should be blocked. Things will go wrong. Things may fall. They are called Faults.

The Software system we build should be Fault-tolerant.

Important -> FAULT != FAILURE

Failure is when a system stops working and stops providing the required service. Whereas Fault is usually expected due to some deviation of the system and such unhandled faults will lead to Failure.

Some Causes:

Hardware Fault:

One of the most known cause of system failure is hardware failure. There might be any cause for the hardware failure which may include Disk crash, faulty RAM, Sudden power cut without backup and so on.
One of the ways to handle this is to have redundant hardware components. (i.e.) one can keep a new machine in sync with the one in production and if some fault occurs, then new machine should be able to replace it in no time. Thus reducing down time.

Software Fault:

Bugs!. Yeah, We have all been there. Unhandled exceptions, Faulty code, Wrongly utilized shared resources like CPU, Memory, Disk etc, Sending corrupted responses, Behaving weird for unexpected inputs, Cascading failure over lot of components (i.e) failure of one component leads to another and so on.!
Example: Ariane 5 Flight 501, a bug in the rocket code resulted in rocket crash.
Writing clean code, thinking of all the edge cases, analysing the system behavior and correct error handling helps great in building great software.

Human Errors:

Humans are unreliable. Period! We wont be even surprised if a human trips over a wire in a DC and system goes down.
These can be much reduced by following well designed APIs and monitoring tools for finding where the things actually went wrong. Also by reducing the whole human job and just automate them. Having a Sandbox for testing the production code is also very much helpful in this case.
Final takeaways related to Reliability:

System should be fault tolerant.
Automate whatever we could.
Have redundant or backup systems in the production.
Whatever happens, the service should never go down.
Let there be unavoidable faults but there should be no failures.
At any cause, try to avoid bugs :)

PS: Big thanks to Martin Kleppmann for the book “Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems book”
Cover image courtesy: Image by Free-Photos from Pixabay