Let’s clear the air quick — SQL is awesome. It’s the most consistently used query language for well over two decades. There’s a ton of love for SQL from thousands of analysts and developers around the world. And honestly, I probably wouldn’t even have a job if it weren’t for SQL simply based on how ubiquitous it is and how much it helps building products.
So before getting into my rant below, I want to say THANK YOU to the people who’ve worked on SQL and SQL systems. You all must have moved mountains to make it all happen, and I deeply appreciate it.
</appreciation>
<rant>
All that said… SQL can sometimes SUCK. Here’s a bunch of reasons why.
1. SQL Syntax is WAY too strict
Can you spot the errors in the following query?
cool_films as (
select
title,
"Description: " + description,
length,
release_year,
from film
where description like '%cool%'
where title like '%cool%'
),
select * from cool_films
How many did you find? Because there are five errors!
- There’s a missing
with
at the beginning for the common table expression - Trailing comma after
release_year,
- … and another trailing comma after the closing parens
),
- Double quotes don’t refer to strings, they refer to identifiers (usually column identifiers)
- You can’t use
where
twice; you have to useand
if you want to specify more filtering conditions
I guarantee you most people who write SQL for work run into at least one of these errors once a week or more. These errors are incredibly frustrating and they pile onto what is already a high learning curve for SQL.
These syntax rules also don’t add much. Is the above SQL that difficult to read? Or even difficult to parse before executing? No! Let me run my query!
2. UPPERCASE IS JUST GREAT, IT’S SUPER CLEAR
IF YOU’RE LOOKING AT QUERY LIKE THIS:
SELECT
something
FROM a_table
WHERE im_not_yelled_at
ORDER BY my_sanity ASC
DO YOU REALLY FIND THIS MORE READABLE THAN:
select
something
from a_table
where im_not_yelled_at
order by my_sanity asc
WHO WANTS THEIR QUERIES TO YELL AT THEM?
HAVE I MADE MY POINT YET??
3. Aliases don’t work after the Select block
Suppose we have the following query:
select
title,
description,
(publish_year / 100) as century
from books
where century between 16 and 19
This will return an error because the alias century
can’t be used in the where
clause. Some SQL engines will allow you to use having
here instead, but it’s not standard SQL.
The historical reason for this is that everything in the select
block is supposed to happen last — meaning the SQL engine will process the line where century like between 16 and 19
first, and it’s going to be like, “huh, I haven’t seen this century
before!”
I’m no expert on building SQL engines, but I don’t really understand why this has to be a rule. Is there a logic error or such a big performance hit that it’s impossible for SQL engines to look for aliases in the select
clause before executing the where
clause? Is doing a quick scan of some text truly that hard to do?
4. Select statements are not dynamic
When I’m iterating and in the middle of working on a query, I often want to do something like this:
with
something_i_copy_pasted (
select
... 50 different columns
from a_big_table
),
-- probably many other CTEs with complicated joins...
select
*, except column "foo", and format column "bar" better
from something_i_copy_pasted
In these cases, I end up having to copy/paste the select
statements across multiple CTEs and modify them slightly. This ends up being a tedious task when I really want to change something small and see what the results look like.
5. Sharing queries & code is very difficult
The oft recommended way to share SQL query snippets is through Views. I’m sure some teams use Views correctly and don’t run into issues, but I haven’t seen one.
My main issue with Views is that they can’t take input variables. For instance, it’d be amazing to have a View where you can modify a specific where
condition or change/remove a join
, or be able to slot in a CTE somewhere. These types of adjustments are nearly impossible without some very complicated Stored Procedures, which are yet another common source of issues.
Unlike a programming language where I can often copy/paste things from Stack Overflow or use a package manager and utilize a library, in SQL I can do none of those things. There are no SQL “packages” that just do a cross join unnest(x)
, or do some gnarly window functions. Above all, there’s no built-in support for a collaborative codebase — most teams have to roll their own version of shared SQL snippets in Github. All this sucks!
6. No variables in ANSI SQL
Yes yes, variables exist in some SQL dialects, but they’re not standard.
Here’s an example where variables could be helpful:
select
title,
description,
(publish_year / 100) as century,
case
when century < 0 then (century * -1) || ' B.C.'
when century > 0 then century || ' A.D.'
as formatted_century
from books
Alas, aliases can’t be reused! This is another situation where I need CTE, or write some really ugly SQL. In my opinion, both are less than ideal.
7. Bad, cryptic errors
Every time I get an error from my SQL query, I want to put my head through a wall. Often times, the errors should be simple or easy to fix, but SQL error messages give little information on how.
Here’s a forced example:
select
first_name
from actor a
join actor b on a.actor_id = b.actor_id
#=> Query Error: error: column reference "first_name" is ambiguous
The error is clear on what’s wrong — there are two columns called first_name
, and the engine doesn’t know which one you want. Makes sense! But would it be SO HARD to tell me how to fix this simple error?
What I really want is this:
select
first_name
from actor a
join actor b on a.actor_id = b.actor_id
#=> Query Error: Column reference "first_name" is ambiguous. Available columns are a.first_name, b.first_name.
Easy peasy! What about misspellings?
select
first_namee
from actor a
#=> Query Error: error: column "first_namee" does not exist
-- nope! this is what i want:
select
first_namee
from actor a
#=> Query Error: Column "first_namee" does not exist. Did you mean "first_name"?
Did someone say “syntax error”?
select
first_name
from actor a
order by last_name
where first_name = 'Brad'
#=> Query Error: error: syntax error at or near "where"
-- "where" do i begin? no thanks! just tell me what the problem is:
select
first_name
from actor a
order by last_name
where first_name = 'Brad'
#=> Query Error: Syntax error at or near "where". "where" should come before "order by"
I could go on and on.
There are many errors to sift through, and I’m sure some have ambiguous or unclear solutions, and I know some of you will shout “use an IDE!”, etc. etc.
(side note: most SQL IDE’s are terrible, but I digress…)
If SQL engines returned with helpful error messages in the first place, no one would need a workaround, or bug their coworkers, or furiously google for fixable issues. Better error messages would be the single biggest UX improvement of SQL that I can think of.
8. SQL is impossible to debug
In many programming languages, there’s built-in support for a dropping debug statements, stopping execution, and being able to poke around and see what your program is doing. This is immensely helpful whether you’re building something new or debugging a problem in your code.
In SQL, there’s no equivalent. Beginners usually don’t know what to do when they’re faced by query that “isn’t working” or — worse yet — a query that silently gives unexpected results and you didn’t know.
Over time, you build an intuition on how to manually “debug” a query, doing things like:
- Starting with a
limit 1
orwhere id = '123'
to limit your results and iterate quickly - Doing a
left join other on a = other.b where a = null
to surface join issues - Running CTEs one by one so you know how the query is “building” from one CTE to the next
I’m sure there’s other methods of “debugging” out there. But these feel like workarounds to me. What I really want is a way to “step through” my query and see how the SQL engine is processing it, and what types of results it gets along the way.
Believe me, I get that building a debugger into a SQL engine would be ridiculously difficult. You’re usually not running your query directly on the database server — you’re sending your query over the network to be validated, parsed, and executed, often on huge datasets. You can’t quite “peek” into how a query is getting executed like you could in a programming language.
… unless there’s a genius out there who can figure it out! I don’t know what’s actually possible, but any progress here would be an immense help.
9. Locked into relational algebra in ANSI SQL
I have to give credit where credit’s due — relational algebra is great, and it’s really useful for a query language to describe how to slice and dice data.
Sometimes, though, I just want a for loop. Just let me drop into a lambda and modify some things here and there. In a similar vein to my other complaints, functions/lambdas do exist in some SQL dialects, but they’re not part of ANSI SQL.
As a result, when you’re locked into relational algebra, it can often feel like a square peg in a round hole. Funky joins, compounding window functions, complex sub-selects — all of which are valid solutions that can, at the same time, feel very “advanced” and sometimes even hacky, even for a seasoned SQL user.
In reality, I want SQL to be as easy as Excel, or as smooth as dropping into a REPL to dig around what’s possible. Users aren’t going to get their queries right from step one, and the nature of relational algebra is that you’re going to get your queries wrong A LOT before you get them right. It’s so demotivating to go through it, and in reality it doesn’t have to be this way.
10. SQL implementations are inconsistent
Of course, the most common complaint of all.
With PostgreSQL, MySQL, Spark SQL, Presto SQL, SQL Server, Oracle SQL, and probably many others… we just can’t get a break! Each of these SQL implementations have slight idiosyncrasies that make it very painful to transfer queries from one system to another.
This issue truly rears its head when you want to use Google or Stack Overflow for solutions. You can’t always search for “how do I do X in SQL”, because some of the answers may be in a specific SQL dialect and won’t help you. For someone new to SQL, this is brutal, and it can take a long time to resolve simple issues.
I don’t know what the root problem here is. There is indeed an accepted SQL Standard, but nonetheless there are many deviations from it. Perhaps if there was recognized implementation of the SQL Standard as a standalone SQL Parser library that any application can utilize, then the industry could converge on the standard naturally.
But alas, here we are.
My hope is that at least some of this resonated with you. Maybe if enough people feel strongly about what can improve, then SQL really will improve.
We spend so much time about the UX of web and mobile products, but not enough, IMO, on developer or analyst products. One day, though, one day.
Thanks for listening to my TED talk.
Top comments (16)
On to #3 Aliases don’t work after the Select block
This is very much a matter of specific dialects - therefore of their engines and parsers.
For some, what you have there would be allowed and would do what you expect, but for others not.
In my current workplace:
For HiveQL the problem seems to me to be the parser, as it's quite trivial to wrap the SELECT as a derived table and then do same WHERE clause. When used with the Hive-on-Spark engine, I'm fairly sure that the underlying engine does it in the one pass.
Which is to say that often the rant is about the implementation not the SQL language (but once again, more on this later).
Interestingly, your comment that "Some SQL engines will allow you to use having here instead" - that's not something I've encountered so I'll assume you're correct. I've only ever used HAVING with a GROUP BY.
Maybe this is a good point to bring in that modern SQL uses at least three variants of the WHERE clause. There's:
where each applies to a separate layer/round of under-the-hood processing. The overview being:
Note: I'm deliberately leaving out the use of WHERE as a de-facto JOIN controller. That's a topic really about JOINs and has even more issues about dialect, parser and engine variations.
Again, dialects vary: in HiveQL you get the windowing functions but not the QUALIFY clause forcing another round of a table-within-a-table.
p.s. predating the use of AS for aliases, some dialects had a (NAMED) mechanism. Strangely these applied at the text level before any real SQL parsing. This would let you specify a name for something inside an expression and re-use it elsewhere in the query - thus only typing it once and no risk of typos in the additional uses. I know dialects where the AS also works like a copy-paste "macro" rather than indicating something should be calculated once at run-time and then re-used rather than independently re-derived. The irony is that by now most engines have optimisers that will spot the repeated derivations regardless of whether they have the same alias or not.
Thanks for the detailed explanation, though I believe we agree that this still hearkens back to #10.
Ah, nothing quite like a rant to air some heartfelt observations. I approve of the honesty you've expressed here so I'll try to merely give views based on my experience. I'll apologise ahead if I come across as too "teachy".
With a list of ten things there, I'll use a series of response posts wherever I think I have something to say rather than attempting it all in one go.
So, #1 SQL Syntax
I think it's fair to say that the syntax of SQL is a bit of a mess. It's clearly just evolved and meandered over the years and we're probably lucky that it has not become utterly dreadful.
That said:
Expressions of angst are one thing, but declaring features you don't (yet) like as "errors" is overreach when you're not showing how you'll avoid inconsistencies. For example you don't indicate which interpretation you think should be made for your dangling commas. I can easily imagine two versions of you with different obvious opinions:
The funny thing, in teaching many people SQL over the years it is the latter that I've had instructees suggest as obvious - often when the SELECT was just for supplying an INSERT - and who then disliked the bother of doing
CAST( NULL AS INTEGER)
such as is often required by various parsers or engines.As for "syntax rules also don’t add much" I would mainly say - they add clarity (to a degree anyway). When you work with millions (and even billions) of rows, that clarity is more than just handy.
p.s. actually there are some gnarlier issues that you may not have encountered yet. I have a habit of putting CASE expressions inside their own brackets
()
because some dialects can't parse them when nested inside more expressions e.g. 3 + ( CASE WHEN col = 'A' THEN 1 ELSE 0 END )Hi @geraldew - thanks for the detailed responses! I'll try to answer them one by one, but I want to mention that I appreciate the conversation!
Ok here goes
Are you talking about Query languages, specifically? Because this is a feature in many modern programming languages, including Ruby & Python
I agree there is a need to specify identifiers, and I do believe that backticks would be much more useful than double quotes. Theoretically this type of feature could be behind a feature-flag to preserve backwards compat.
The fix here would be to ignore everything past the dangling comma. This is the expected behavior in many modern programming languages. Expecting a dangling comma to have a
null
in either a CTE orselect
block seems very strange and unexpected.I do believe there's a tradeoff between clarity and UX. In this case, I believe the tradeoffs for better usability are worth it.
Number 10 SQL implementations are inconsistent
I've jumped to this one because a couple of my other responses alluded to it.
In a rough sense, you're completely right on this point. The variations are maddening, and you can be years into using a dialect before realising that you've been presuming it worked the same as some other one before finding there was a slight difference.
However, from my position, where I've been watching some of these evolve over the decades, I can say I'm glad things never got set in stone by the "standard". While there have been inevitable dead ends, by and large it has been the variations and vendor experiments that have let SQL somehow stay SQL and yet gather features that we mostly find useful.
The caveat though, is that SQL has always been the QWERTY of relational scripting work. You couldn't call it a well designed "language" because it simply isn't. As you point out elsewhere, its flaws are such that there's no clear path to "fix" it.
It doesn't even really do "relational" all that well - you'll find criticisms of that kind going a long way back if you search for them. As dialects and engines evolve with new features such as table expressions, it's tempting to say they're drifting even further from being "relational". There's always a lot to argue and discuss.
When I'm teaching colleagues new to it, my first words are usually "SQL is crap language". Like the "standard" keyboard, we all know we should have something better but we persevere with what we have. I don't really see any way of defending that (and the matter of alternative relational languages is yet another topic).
With respect, I can't disagree with this more. Drifting SQL dialects is a massive pain point, one also felt by Java on the JVM, and by browsers in the early days. In the grand scheme of things this really hurts the users and creates a sense of vendor lock-in that isn't necessary.
Next, #4. Select statements are not dynamic
Certainly some data engines provide features just like this, so I guess your point is that these have not migrated into such standard use as to be in them all, and to then not be handled with a language standard.
For example, Teradata has two of these:
Neither is the exactly the same as what you want. And all present some form of management issue when what you're writing needs to shift from exploratory code to being a production script of data actions in sequence to run again and again.
e.g. To redo a temporary within a session you still have to do a DROP TABLE to get rid of it before recreating it differently.
See complaint #10 😄
It doesn't help if some SQL dialect, somewhere, has a feature. I need it for all of them!
Next #2 UPPERCASE
Ah, the classic language argument - and one with no winners.
Except of course, this is not a requirement at all with SQL, it's just a common style. Do what you like, your code will work.
Personally I like it, but then I date from programming in Modula-2 where it was mandatory as well as a time before colour displays and syntax highlighting. In the green-on-black screens and crude dot matrix printouts, the UPPERCASE words stood out nicely.
Maybe that colours my view, but it means with your examples, yes, the one with uppercase keywords was easier for me to read.
I suspect what you want/need is a syntax highlighting editor that does CASE display change along with colour.
And we haven't even discussed line and indent formatting of SQL scripts, which is where the real bun fights take place. I'd say the various habits and preferences of that make the uppercase/lowercase thing look like a mild issue.
Of course! This is one of many the "petty" complaints I had.
The usage of uppercase and formatting & such is very subjective. In my view, there's at least a few things to note against the use of uppercase:
I'm always glad to see people who know SQL well call out its flaws. It is amazing to me that SQL is still the best option we have available to talk to our DBs. Even Typescript has gotten wide adoption as a better Javascript, but for SQL I don't expect any alternative (including my own) to gain wide adoption soon.
You might be interested in my library/language Plisqin, although if you've never used a Lisp-family language it might be too inaccessible: docs.racket-lang.org/plisqin/index...
Items 3 and 6 can be easily solved using a local variable for
century
like soItem 4 can be solved in a lot of ways, I think defining a list of column names would be easiest.
Item 5 is trivial -- you simply create a normal Racket procedure which takes whatever variables it needs. This procedure builds and returns the query. Just like a view, you can build a larger query with the result by adding where clauses or joins or whatever you want.
Item 10 is unfortunately a huge pain for me.
One of my biggest gripes about SQL is how often I end up repeating joins. For example, imagine every Book has exactly one Author. If I have a Book and want to talk about the AuthorName, I have to manually include the join from Book to Author every single time! [1] Why should I even have to know that a join is involved? I know that AuthorName is a valid property of a Book, I shouldn't even have to care whether it is a direct property of the Book table or a property of some joined table. Plisqin allows me to define a function "AuthorName given an instance of Book" and then I can reuse that function just the same as I would every other property of Book.
[1]: Yes, I could create a "BookWithAuthor" view, but then maybe I need a "BookWithAuthorAndPublisher" view also, and a "BookWithPublisher" view... The combinations become unmanageable very quickly.
I can live with uppercase, but then again, I've enjoyed programming Modula-2/Oberon in the past, and it's easy enough to automate it either way. But know what really grinds my gears? Excessive quoting of everything that isn't a keyword. Backtick heaven. (Seems more common in some databases than others)
I'm surprised, too, that there isn't more tooling around SQL-as-a-language, as opposed to specifc RDBMS. Compare it with HTML (another bloated beast). There you can have different syntaxes in template systems (like HAML), and alternate convenient ways to enter things (like Emmet). Not aware of anything like that for SQL. (And no, "expressive ORMS" aren't that, they fall into the same unreadable trap like DSLs that generate HTML. Or JSX, in the worst case).
Thanks, just learning SQL and this was therapeutic. I'm not crazy, SQL is just deeply deeply flawed and extremely hard to learn. No reason it can't more elegantly be like real code where you call other tables you create like re-usable tables. (this is possible, but seems hard/annoying)
The relative immaturity is astonishing for such a massively used 'language'.
As a developer, I miss being able to refactor code and navigate to symbols.
For example, if I rename a column, I'd like it to automatically get renamed in any view or stored procedures that reference it. Unfortunately, this does isn't the case. I have to remember to find all places where that column might be used and rename it manually.
Or, if I'm referencing a view in my code and I want to peek its definition, I'd like to be able to press F12 and navigate to it. This is more of a tooling than a language issue, but it's not something I usually see in SQL IDEs.
It just feels like user code in the database is disjointed from the schema.
While I do think SQL leaves a number of things to desire, I don't see much in the objections you raise here.
Some items have already been adressed by @geraldew , I just wanted to add:
6. No variables in ANSI SQL
There is a good reason for that - variables are an imperative device and in many cases using them would force a definite execution order, which makes it harder (impossible) to optimize the query. This is not unique to SQL, you see this in functional languages as well.
For the same reason, if your SQL dialect lets you use user-defined functions, then they should typically be free of side-effects. If they are not, unexpected (bad) things will happen.
Now, about your example, you wrote:
It now looks like you would need to duplicate the publish_year/100 twice, and this is what you would solve with a variable.
Now, typically you'd want to push this type of formatting to a reporting tool. But if you really want to do it in a query, you'd simply use a built-in date formatting function. And, you'd make your life considerably easier by storing the publish_year as a date.
With that in place, the century expression would become (showing postgres examples):
to_char(publish_date, 'CC')
and
abs(cast(to_char(publish_date, 'CC') as int))||' '||to_char(publish_date, 'B.C.')
Now, you could still nitpick over the duplication of the 'CC' formatting of the publish_date but the point is that usually when people say they need variables in SQL, they really don't, they just need to use built-in functions to do the job for you.
(see: postgresql.org/docs/8.2/functions-...)
Thanks for responding, @rolandbouman
With respect, this is a classic example of "this isn't the product's problem, it's the user's problem." Yes, of course - users have been dealing with this and using workaround for years now. My question is - can this be better?
You're saying it's impossible to have some sense of variables in SQL, with some sane guardrails, and be able to optimize a query? I doubt that, given that there are SQL dialects that have variables.
If dividing one number by another static number is an execution that must be done before others, I think that's a fine tradeoff for most/all users. There must be a better way.