re: MongoDB has no use case VIEW POST

TOP OF THREAD FULL DISCUSSION
re: Set up sharding and you'll have to delete your foreign keys anyway. End up with a table with a few billion rows and changing the structure is going...
 

You mean, from the point of view of scaling? Otherwise, I guess MongoDB too would need some tuning that only an experienced DBA can do?

Largely that, but all kinds of other little things. If talking about scaling, MongoDB specifically does fairly little (though replicasets are super easy to setup), but things like Riak make most of that work quite a bit easier.

Yes, of course, but please realize that this article written in the spirit of fun. I don't mean any of the vitriol contained in it, and I hope that much is clear from the disclaimer and the post-script.

Sure, but the statement is sort of difficult to phrase in another way without losing the meanings. It's often annoying how people are stuck in a mindset that their DB choice is what makes their application great or not great, or saves them from programmer errors.

Personally I find that the DB choice matters fairly little in the end, but of course specific use cases should be thought of. What matters most is if you choose a DB that's constantly going to fight you, or if you choose a DB that will let you get on with your work.

Also let's face it, it's your programmers who write your DB definitions. Guess where they can make the errors in if you let them create logic, limitations, etc. in the DB? Also fixing your DB tends to be more work than fixing your code, largely because the peoples' expertise is with code, not the DB.

Could you please expand on that?

Ok so say you have a user table. You want to add a column birthday. Well you don't know the value for the existing users so you have to set the default to NULL or similar. You can either create a column in SQL with the default value to be NULL, or you could use a document store - leave the objects in the database as they are and in your model class in your code you just add the birthday -field with a NULL value as the default. Then if you've written your DB code nicely, it will only set the values that were present in the DB and you end up with a NULL birthday.

Similarly even slightly more complicated migrations, such as merging first name & last name to one field, calculating a hash of a field for faster indexing, or some such can be done with code in more interesting ways.

If you run a query like UPDATE user SET name_crc=CRC32(name), you have no control over the rate at which it processes data, it might cause pretty heavy load on your DB depending on the DB and the underlying hardware.

If you make a script that runs through all the entries in your DB and does that in code, then you have much more control over the rate and other such details. It might take a bit more work, but after you've written your first one like that you can just reuse the same logic. You additionally gain things like the ability to lock each entry with a distributed lock across your code and you won't end up with race conditions:

  1. A request incoming to your API loads the DB row
  2. UPDATE user SET name_crc=CRC32(name) processes the row
  3. Request completes and saves the data in it to the DB

In THIS specific example it doesn't matter much (your code should probably set the CRC32 properly at this point), but in many other cases it does. If instead you do in a script:

  1. Get a distributed lock for your DB object
  2. Load the DB object
  3. Process it
  4. Write it
  5. Release lock
  6. Next item

Assuming you use the same locking mechanism across your software (you should), then you can be pretty sure nothing else changed the underlying object at the same time.

So, you are of the opinion that MongoDB is best left to quick prototypes only?

No, but it's where it really shines. MongoDB does have a bunch of issues, but it's not like many of the same issues don't exist in other popular databases. It's basically picking your poison.

I've personally found MongoDB scales fairly effortlessly to decent sized environments, but I would probably try to find more use-cases for Riak, Cassandra, and others as the load grows to make sure I don't rely just on MongoDB too much.

Sorry for the late reply!

or you could use a document store - leave the objects in the database as they are and in your model class in your code you just add the birthday -field with a NULL value as the default.

I don't really see how missing fields are better than fields with NULL values. Ultimately, in the code, they'll both respond to logical comparisons equally nicely. 👀

If you make a script that runs through all the entries in your DB and does that in code, then you have much more control over the rate and other such details.

Again, I see no reason why we can't do this in a relational database.

Your distributed lock example, though, hits right home. I don't think this is something we'll ever manage in something like MySQL. Even read locks are tricky, from what I gather.

Personally I find that the DB choice matters fairly little in the end [...]

Honestly, I'm beginning to think that way, and the more I dive into discussions by senior developers, the more I see that foreign keys and other "nice" constraints are actually a stumbling block in the long run.

In the end, though, I have a nightmare scenario that keeps me from embracing document databases. Let's take the classic example of a blog: you store the posts and user's comments as an aggregate, and it makes a lot of sense. Imagine, though, that the management wants to know which comments were posted between 4 pm and 8 pm on a certain date. If it's a document store, I'll have to scan through all the aggregate fields. 😭

BUT, I'm also not sure if a SQL BETWEEN is going to be very useful as the number of rows hits millions.

Thoughts? :-)

I don't really see how missing fields are better than fields with NULL values. Ultimately, in the code, they'll both respond to logical comparisons equally nicely. 👀
If you make a script that runs through all the entries in your DB and does that in code, then you have much more control over the rate and other such details.

Missing value does not always equal NULL, and you might want your default to be different from NULL, while having the field nullable. Lots of cases. Also it's not as much about "not being able to", as I believe nowadays MOST SQL databases can do live schema changes without the whole database freezing for a long time, but more about habit.. RDBMS users tend to just think differently.

Your distributed lock example, though, hits right home. I don't think this is something we'll ever manage in something like MySQL. Even read locks are tricky, from what I gather.

MySQL can function as a nice host for that to some extent, with GET_LOCK(), but it can't scale past one server easily. Really you want something more like etcd.

Let's take the classic example of a blog: you store the posts and user's comments as an aggregate, and it makes a lot of sense. Imagine, though, that the management wants to know which comments were posted between 4 pm and 8 pm on a certain date. If it's a document store, I'll have to scan through all the aggregate fields.

There is such a thing as the management being wrong, and you, as the expert, being required to tell them they're wrong.

Also, you can design your schema wrong in any database, why is it the fault of the database that you made a mistake? If this is a change in requirements, why is it out of the question to spend development effort to change the data structure?

Lastly, you should pick the right tools for the job, not try to do everything in the one tool you happened to choose - in case of a database if you have read-only replicas for analytics etc. you can pretty safely run even heavy queries there (MongoDB can probably do more complicated queries than MySQL with less effort due to it's aggregation and map-reduce systems), but really when you're starting to do more complicated searches you should use a system designed for complicated searches, such as ElasticSearch.

There's also denormalization for lazy (and smart) people, just save the data twice in different structures, so it can be fetched optimally in your different use-cases.

code of conduct - report abuse