Here are some notes on this original video from Tim Berglund. Check it out!
As far as the client goes, a distributed system is a set of computers that act as a single one.
Characteristics
- They operate concurrently
- Fail independently
- They don't share a global clock
Three topics
1) Storage
Start off with a single relational database, you usually get more reads than writes
Read replication
To mitigate the excessive load a DB might be exposed to when the reads become too much, you can use read replication
. One master node for writes and several for reads. This is fine until the master node saturates (usually on writes, as reads can scale)
More info on read replication here
Sharding
Sharding
. You split the read replication scheme by a key (say First name, so from A-F is one shart, from F-N, then N-Z)
✅ Solves | ❌ Creates problem |
---|---|
- You now have more room to distribute your writes | - Consistency is broken |
- Data model is broken (no JOIN across shards) |
To solve latency issues, you may consider denormalizing your DB (aka removing relations between tables). Please don't.
More info on sharding here
Consistent hashing
Say you have a key/value database. You hash the key and store it according to the shards you have determined. In this example
Each shard indexes a 1/8th of the content, indexed by the hash's first number. If I want to store
{key: "Tim", value: "Americano"}
I'd hash the key first
{hashedKey: "9F72", value: "Americano"}
find the corresponding node (in this case A000
since A000
> 9F72
> 8000
) and store it there.
Replication
We'd need to ensure there are no problems with system failure (aka one node going down), you can replicate the data in other nodes (say in the next two nodes as well).
❌ The problem that comes with this is inconsistency, since you are now writing to more than one node. What if you want to write to 3 nodes but one is down? It'll be inconsistent when it's back online.
Consistency or R+W>N
If the number of nodes I write + the number of nodes I read is greater than the number of replicas, you have strong consistency
CAP theorem
- Consistent: When I read something I want it to be the latest version of it.
- Available
- Partition tolerant: If a part of the system falls, the entire system must hold. The theorem basically states that sometimes you must give up one of the three parts in real life applications. Example: If a partition goes down and a read request comes in, you can either:
- a) Give the best answer you can, although it may not be consistent
- b) Sacrifice availability and return an error to ensure that no inconsistent reply is sent.
2) Computation
Distributed computation is basically like moving from a single threaded computation to a multithreaded one, but several times worse.
Imagine you have a bunch of data sitting in a lot of computers and you need to do stuff to it.
Old way: MapReduce
Pushes all your data through two functions
- Map
- Reduce
Example: Say you have the sentence
I want to eat two burgers with fries
and drink two sodas with ice
Map
When you push it throug a map
function, it basically makes a "stateless" count of each word (since it won't have any memory yet, so all numbers will effectively be 1)
{I, 1}
{want, 1}
{to, 1}
{eat, 1}
{two, 1}
{burgers, 1}
{with, 1}
{and, 1}
{drink, 1}
{two, 1}
{sodas, 1}
{with, 1}
{ice, 1}
Shuffle
Since this is a distributed computation, sometimes you may need to move these words across the network, write them next to each other and so on.
Reduce
The reduce function in this case will count properly all usages of a single word
{I, 1}
{want, 1}
{to, 1}
{eat, 1}
{two, 2}
{burgers, 1}
{with, 2}
{and, 1}
{drink, 1}
{sodas, 1}
{ice, 1}
We have now managed to split the work of map and the work of reducing among many computers.
More info on map-reduce here
Spark
map-reduce becomes transform-action. Same scatter/gather paradigm.
Spark is also storage agnostic, compared to Hadoop.
Kafka
In kafka everything is a stream, in real time.
3) Messaging
- Systems can talk to each other instead of talking to each other via shared databases.
- Messages are consumed by subscribers
- Created by producers
- Organized into topics
- Processed by brokers
- Usually persistent in the long term, but you can change the retention period to whatever you want.
❌ Problems you may encounter:
- A topic gets too big (huge retention period, messages are huge, read and writes are too fast...)
- Reliability
- Guarantee delivery even if a computer is down
Apache Kafka
⚠ When scaling, you lose system-wide order. You still have partition-wide order.
Producers will create events and randomly assign it to partitions (not really randomly, but for the sake of the argument let's assume so). These partitions will have a consumer on the other side processing whichever event is first on the queue.
The great thing is that, if after the event is created you want to store it, you can! Anyone can subscribe to the topic and from the event generation, you may have a consumer that stores on a MySQL DB the information and another that does some form of real-time processing.
More info on kafka here
Done!
I hope these short notes help you! Don't forget to check the original video and share it!
Have a wonderful day!
Top comments (0)