Thanks for the really interesting article @uyouthe
When the database is distributed, can there be situations where a user on one server will get different results from a user accessing a different server?
I'm thinking that there are basically two scenarios:
In one scenario all data is local to, say, each user. In that case we can use sharding to separate the databases, and there are no consistency problems: User A goes to database A, user B goes to database B and everything is fine.
However, in the second scenario, if I need to get data that is not "native" to my server, then I either have to access the other server directly for that data, which presumably lowers scalability, or I need to use some cached version of that data which may not be up to date. Is that a reasonable assessment or is there more to it?
First of all, thanks!
No, data inconsistencies are not possible in distributed databases. As soon as we go distributed, a CRDT algorithms steps in to ensure data consistency. You basically can’t go distributed without CRDT, and Riak got you covered. This is why you can access the data through any node – conflict resolving and syncing made under the hood.
You either go full distributed or not distributed at all. With just a master-slave replication, slaves just copy the whole dataset and thus any inconsistencies aren’t possible.
Thank you for your reply! Your answer prompted me to do some reading. If I understand correctly, it seems that this kind of approach relies on the idea of "eventual consistency." However, if that is the case, it does seem that different nodes can potentially return different versions of the same information. That is, a node can potentially answer a query with data that is not up-to-date (over some finite interval of time) even when there are other nodes that do have the up-to-date information. This is something I am interested in, but do not have experience with, so do let me know if I've misunderstood, or if Riak works differently from "basic" eventual consistency...
Yes, you’re right. At distributed system, you can go for ACID, but it will be slower. However, Riak seems to use eventual consistency and vector clock:
In CRDTs, the Cap theorem is always taking place.
Its also extremely important to look into the use case of the setup.
It is quite possible that only small handful of data operations would need such level of database consistency. Where you can have large scale distributed DB for almost everything but one API endpoint.
Facebook for example is known to still extensively use MySQL throughout their system, during such situations (exactly where however is unknown).
Majority of data operations would then typically use "eventual consistency" where it makes alot more sense at their scale.
With some rather careful planning of API and data flows, along with DB sharding. And of course time to invest in the relevant dev work... it can be quite surprising how little data operations need true complete consistency.
Of course yes. Going that big, it's important to have a well-defined operations workflow.
Once you're using replicated MySQL with nontrivial lag among replicas (thing US to Europe or Asia), it's eventually consistent anyway unless you force your reads to go to the master.
Facebook's backend is still pretty much all MySQL, but backing another system called Tao.
We're a place where coders share, stay up-to-date and grow their careers.
We strive for transparency and don't collect excess data.