By no means have I seen everything there is to see in the world of development, but I have been around long enough to see the change from needing a single database which housed all your data, to this world where there’s SO MUCH data that you couldn’t possibly store it all in a single database, on a single server, or in many cases, not a dozen servers.
As the volume of data has increased, we’ve seen new patterns emerge such as microservices, database sharding and more. Many times those tasked with adapting to these changes try to take our existing toolsets such as languages, or database technologies and we try to force them to handle those changes along the way. Ever heard that saying, “When you have a hammer, everything looks like a nail”. Check it out – there’s an entire Wikipedia article dedicated to it here.
I suggest that it is incumbent on you, as a developer, as a manager, or a director, or a leader…someone in the capacity making technological decisions to make yourself aware of tools that already exist to solve the technical challenges you or your organization encounter.
For ANYONE who deals with large datasets, this is a fantastic reference. There was a quote in Chapter 3 that absolutely spoke to my soul – “You’re probably not going to implement your own storage engine from scratch, but you do need to select a storage engine that is appropriate for your application, from the many that are available”. Before I go any further – that statement about you’re probably not going to write your own storage engine…in MOST cases, you probably shouldn’t!
So let’s start with the easiest one for me, as I’ve been around the technology for so long – your relational database. I’m not going to pick on any particular RDBMS because if you’ve worked on a project where a database system was your storage engine, then you’ve probably seen this.
Everything was great – you have your database server handling your transactions and maybe have some input / output screens…now, the customers want some analytical reporting. Sure, no problem…SQL handles GROUP BY’s and HAVING clauses, let’s make this happen!
Time passes…and now, those analytical queries are competing with your transactional queries. You now have deadlocks in the system, pages are taking longer to load. Maybe pages are crashing. Point is things aren’t great. So you go and enlist the help of some database admins / developers and you get some good indexes created, write some more performant code in stored procedures, etc. Things are good again.
Now, customers want some custom filtering capabilities on the UI that is fronting the database…”No sweat”. Maybe write some complex stored procs, maybe some dynamic SQL, and you’re good to go.
Well @#$%@#$%…now the users want to be able to search and filter that data that is joining across 20 different tables in the database! What are you going to do, tell them “No”?! Yeah….let’s see how long that lasts…
So now you go back and enlist some more help to try and get things under control, but the indexes only get you so far (and consequently slow down your writes, hanging your transactions longer). Now you find yourself writing much more complicated code to do rather “simple” things, at least in the customer’s mind. I mean seriously – your end-user sees a grid with 50 columns in it, why can’t they filter the data in those columns? They have no idea that those 50 columns are across 20 different tables in the database and that it’s not going to be easy. For you, as a savvy SQL developer, you know you can work some magic and stage things into temporary tables to try and filter down the data before you even do the massive JOINs, and you MIGHT even get a halfway decent solution together.
As the volume of data in tables increases, at some point, what you did previously is not going to stand up to the test of time either, and now you’re going to find yourself creating denormalized versions of the same data and indexing it 8 ways from Sunday – AND YOU’RE STILL GOING TO HAVE PROBLEMS. Let’s not even get started on some of the things that would really send this into tail-spin such as full-text searching, or scoring for relevancy. We’re talking basic searches without wildcards and all that – exact match searches…and it’s still a nightmare.
So what we have here is the database hammer – and what a massive hammer it is! Depending on your database system of choice, there’s a good reason it is the hammer that many a developer choose to wield (and I’m not just talking about database developers – I’m talking about ALL developers)…if you’re using a system like SQL Server or Oracle, those things are the Swiss Army Knives of database systems!
They’ve got job schedulers, reporting systems, ETL tools, analytics tools, multiple storage formats, different types of indexes…you name it, they’ve got it. But just because you can do it, does it mean you should do it?
Before anyone thinks I’m picking on the SQL Servers and the Oracles of the world – please know that I’m not I’ve seen this in other areas as well. I’m sure you’ve seen things done in a programming language that would have been better suited for a shell script, or using an incredibly complex microservices architecture when a simple monolith would have done just fine – because that’s the tools that the developers knew and loved.
This is where I get to the point…in the story above, at SOME point, someone should have looked at the amount of effort and heartache and said, “This doesn’t feel right”. For people with experience in various systems, that tingling probably started BEFORE any work took place, but surely for everyone involved they had to be aware they were spending too much time and effort for such little returns and constant baby-sitting the system.
Again, just because you can make something work, doesn’t necessarily mean it’s the correct approach. I’ve found in my career that for some weird reason, many development teams would much prefer writing and maintaining (sometimes painfully so) tens of thousands of lines of code, rather than finding and running systems that were designed to solve the exact problem.
If you are someone involved in projects that feel like they’re going down the wrong path, you should invest some time to at least find if there are any solutions out there that seem to fit the problem you have. This doesn’t mean that you should just keep adding tech to your stack willy-nilly because supporting and keeping solutions operational requires knowledge and time as well…this is where educated, calculated decisions should be made.
Revisiting the database story above – it’s not unreasonable to think that hundreds of hours of time was invested in just creating solutions that would work. It’s also not unreasonable to think that hundreds more hours were spent trouble-shooting and identifying performance issues.
The problem, typically, is we as developers are looking for specific answers using specific technologies – “How do I make this query faster in SQL Server?”. It’s so easy to get tunnel vision when you have a prescribed set of tools, to live within ONLY those tools…
If someone had looked outside the database world, they would have seen that there are a few popular search engine technologies that answer many of the problems encountered, and using those technologies would have likely solved more than just the search issue. This is what I mean by “rolling your own” – in a sense, a custom search engine was created (likely with the wrong underlying storage engine if you’ve been following along with our Designing Data-Intensive Application series), whereas choosing a purpose built search engine would have provided features such as…
- Fully featured search – full text searching, wildcards, etc.
- Faster search performance
- Less storage space required
- Reduced load on transactional database for increased performance
- Search relevance scores
- Searching across multiple indexes simultaneously (search for “Joe” in computers and people)
How much time would be spent implementing and understanding the new system? Let’s assume it was the SAME amount of time as the database – is it a sunk cost? Was it a wash? Likely not as you got the benefits listed above…so if nothing else, you have a system that’s running more smoothly as a whole, and now rather than focusing on writing complex, magical code to squeeze the last bit of performance out of your database, you can focus on improving the user experience and adding features to the product…
Honestly, this isn’t the easiest question to answer. For example, Googling “performant search”, three of the top five search results in Google were related to how to make a SQL search faster. This is where I believe investing some of your time into things like podcasts such as https://www.codingblocks.net, browsing hacker news https://news.ycombinator.com/, visiting Apache’s open source projects page organized by category https://projects.apache.org/projects.html?category, or attending conferences, meetups, etc. These are all ways to grow your awareness and ultimately help yourself, your team and your organization.
I’m not suggesting you should always be looking to install new products. I AM suggesting that you should take some time to keep yourself in-tune with what’s out in the ecosystem so you can, at the very least, know HOW challenges similar to yours were solved and WHY they were addressed in the ways they were. Sometime it makes sense to roll your own and other times it makes more sense to add another tool to your tool-belt. When you’ve found that right balance, you’ve grown as a professional.