I recently read some post about yet another NoSQL solution. This reminded me of the project which was nearly killed (I left before it have died) - and I want to share the story with some analysis of how it happened - so that it may help others avoid such silly trap :)
It is about real commercial project for some large company which publishes music. The project was a web-application which helps company clients - e.g. musicians, song-writers, singers, or groups of them - to arrange their deals with company. So it had accounts with various data, recording sales, purchases and amounts of money to be paid etc.
Front-end was using several modern JS frameworks and worked as beautiful SPA, getting data in REST/JSON from backend.
Backend was in Java and used NoSQL database for storage. Or rather not single database, but several solutions.
When I was interviewed for this project, I asked "why using Cassandra? Do you have some bigdata processing?" They answered "No, but modern projects are often built with NoSQL".
Right. It was in 2014 - the peak of the hype about NoSQL.
Database needed just typical operations:
- storing users
- storing their products (from songs to t-shirts)
- storing their transactions (what is sold or bought etc)
- making various reports by calculating, joining, aggregating etc.
All this could be easily done by normal SQL database - like MySQL, Postgres etc. The main pros for NoSQL are usually either better handling of "multi-node" mode (SQL dbs usually offer sharding and replication only) - and sometimes easier approach for describing data (e.g. "schema-less").
Cassandra solution
Initially folks were using Cassandra. It is interesting "columnar" database. When you just start it, you won't see much difference at first. You can create tables, insert records here.
However soon you notice some problems:
- Cassandra is great for storing data, but not for extracting them - you can only fetch data by indexed keys and even this is not very efficient;
- searching with complex queries, using joins etc - is almost impossible.
To solve these limitations guys used two more storages over database:
- dedicated cache (Hazelcast) which allowed speed-up retrieval of recently stored or viewed records;
- search engine (Elastic Search) to allow extracting data in various ways.
Stupid thing is that any of this was a storage on its own. So really guys were using 3 databases to store the same data instead of one.
Of course at some point people become very tired of this.
OrientDB attempt
So our architects started thinking of another database engine. Of course also NoSQL. Because of hype. By then one of most promising was OrientDB.
It is graph DB which also may look as document-oriented storage. It boasted "multi-master" mode, almost unheard of in other databases by then.
And by rough experiments it worked far better and easier than solution with Cassandra. It even allowed some kinds of joins etc.
So we spent efforts of 2-3 people for about half-year to rewrite or generalize various programming interfaces - in order that application can be switched from Cassandra to Orient one day.
Regretfully this failed miserably. It appeared that by then OrientDB had several bugs and one of them quite nasty. When we update records we usually "lock" them so that other users of DB see everything in consistent way and can't partially update the same record at the same time.
And the bug was - sometimes OrientDB didn't unlock some records after operation. This only happened in "multi-master" mode. While we were developing in single-master, everything worked well.
The bug was acknowledged by developers and it was told it is going to be fixed in Orient version 2.0, but we were not ready to update code once more - and anyway it was not fixed in 2.0 preview available by then.
Conclusion
- I won't say NoSQL is bad. But it is important to understand - they are different things and often for different goals. And we should remember all NoSQL databases are different from each other, and may require study and investigation.
- Of course we should try new databases, new solutions sometimes. Developers should progress in their knowledge.
- However, changing our main "business-database" of the application from SQL to some NoSQL solution will most probably be very painful and with unpredictable results.
- Better strategy is to use SQL and NoSQL along. If you see that some tables would be nice to be moved to NoSQL (e.g. due to their size) - let's try! It is always easier to migrate single table back if something went wrong. It's just as keeping static content on different server for web-site.
- Bicycle with square wheels, shown above - it is really good for some kind of non-flat road. This is well known from geometry. But we shouldn't try using it everywhere. It is the same with NoSQL. :)
Don't hang on hype! Be wise and cautious - and you'll never get unhappy because of NoSQL solutions!
Top comments (9)
I'm afraid this is not quite correct. Graph DBs give some specific advantage - "graph" queries. It is hard or impossible to do with normal SQL (popular interview question about some Oracle feature I think).
But there are other types of queries (this is mostly about certain joins) which are really hard to simulate on graph database. We made some simple substitution for some of them and did "manual join" (extracting data and doing this in code) for others.
Regretfully I'm not really sage about Orient - as you probably noticed this situation was over 5 years ago. Orient had turbulent times since then - but recently it somewhat rectified.
As about "schema-less" - note that nowadays RDBMS can store JSON data and operate on their fields. Which really makes them good even without schema :)
We use Cassandra at work and one of my chat bots is missing half it's features cause I have no clue how to do then in NOSql, but would of taken 5 minutes in SQL. My boss insists there is a solution but hasn't found one yet. Using only NOSql for everything has been a struggle for me at least.
Cassandra is great for certain tasks, like pushing tons of data into it very fast, performing bulk processing etc. It is used as low-level storage in some other special databases (e.g. time-series databases)...
Regretfully, exactly as you said, some operations are painful or impossible with it...
Exactly, I can not stress enough how hard it is to use Cassandra, and by extension probably any non-relational, database for data that is very relational...
Thanks a lot for the link to book and to your list of resources!
As a side note to this:
While microservice architecture ceased to be novelty for some years already. I rarely found project nowadays which doesn't utilize this approach. However You are very right, it seems to me. But not only because of themselves. Lot of troubles comes from fact that projects get new requirements and features over time. Monolithic architecture suffers from such extensions. Microservices also suffer from these extensions. With microservices it is just somewhat easier to do housekeeping, partial updates etc. But still no way to do things "very right - and from beginning" because at beginning we don't know what exactly would be "right" some time later. (my meek personal opinion)
I think I'd better try to come up with some good example instead of bewildering people with vague statements :) As this may take time I'll probably do this in separate post and drop a link here when I have it.
Thanks for this point!
Luke, here is my first attempt!
dev.to/rodiongork/rewrite-this-sql...
BTW great hairstyle. I feel envious!
Honestly, I'm not sure there is sensible benefit in this. Scala definitely failed to become "the language of the future". It will surely work, but considering memory consumption and the language being overcomplicated with unclear ideas of meeting Scala 3 - I personally won't be glad to undermine project from beginning.
Perhaps I should add post "how we killed the project with Scala" :) but it was recommender, which is somewhat different.
Which however sometimes can end up in great entanglement of dependencies between microservices, technically losing both separation and autonomy. But still microservice organization allows to keep order longer. :)
I usually can't draw hard line between monolith and microservices in real projects. Former monolith ones often get some satellite services and become more like set of microservices. On other hand in microservices structure often one or few become larger over years and resemble monolitic microservice :)
I personally don't really like the term NoSQL. What does it mean? "Everything else but SQL"? What is SQL then, anyway? Can we say that KSQL, Elasticserch SQL or Hive are SQL?
Instead, we can clearly classify database engines by their breed. Like, Cassandra is a key-value store, as well as Redis and DynamoDB. MongoDB is a document database. Elasticsearch is a document-oriented search database. MariaDB or PostgreSQL are RDBMSes.
After we have done this simple and way more precise classification, we can find out if the tool suits the job. Do you need queries? You'll have a hard time querying key-value stores. But it's very easy to query MongoDB, but you have to pay attention to your document schema. If you used ORMs before, you probably would be quite happy using MongoDB, unless you overuse relations between tables. Do you need to index and search a massive set of documents with somewhat loose schema? Then, you can try Elasticsearch or maybe even be happy with MongoDB full-text search, but no key-value store will ever provide you with such a capability.
You can still combine the high transaction throughput of Cassandra with a queryable model by making change-feed processing, projecting data from Cassandra to another database that has better support for queries. In that case, you will have to deal with some eventual consistency but it might be a tradeoff that is acceptable.
My point here is: please don't fall to a trap of "SQL vs NoSQL" discussion. These terms are vague and bias-prone. Sometimes saying "I could do it with SQL in 5 minutes" mean "I know how to do it with PostgreSQL in five minutes, maybe using Oracle will take me an hour, and with MongoDB, it will take 30 minutes but it will be easier to maintain. And there's no way I can use Redis". Such a reasoning has much more value.