DEV Community

Cover image for 10 Things I Hate About SQL

10 Things I Hate About SQL

Shaker Islam on February 02, 2021

Let’s clear the air quick — SQL is awesome. It’s the most consistently used query language for well over two decades. There’s a ton of love for SQL...
Collapse
 
geraldew profile image
geraldew

On to #3 Aliases don’t work after the Select block

This is very much a matter of specific dialects - therefore of their engines and parsers.
For some, what you have there would be allowed and would do what you expect, but for others not.

In my current workplace:

  • for Teradata SQL that would work
  • for HiveQL it would not.

For HiveQL the problem seems to me to be the parser, as it's quite trivial to wrap the SELECT as a derived table and then do same WHERE clause. When used with the Hive-on-Spark engine, I'm fairly sure that the underlying engine does it in the one pass.

Which is to say that often the rant is about the implementation not the SQL language (but once again, more on this later).

Interestingly, your comment that "Some SQL engines will allow you to use having here instead" - that's not something I've encountered so I'll assume you're correct. I've only ever used HAVING with a GROUP BY.

  • (aside: gosh aren't these uppercase keywords handy when talking _about_SQL ?)

Maybe this is a good point to bring in that modern SQL uses at least three variants of the WHERE clause. There's:

  • WHERE
  • HAVING
  • QUALIFY

where each applies to a separate layer/round of under-the-hood processing. The overview being:

  • WHERE - applies first, filtering rows
  • HAVING - applies next, after the aggregations, typically for the rollups of a GROUP BY
  • QUALIFY - applies last, after the OLAP/analytical/windowing functions have been derived

Note: I'm deliberately leaving out the use of WHERE as a de-facto JOIN controller. That's a topic really about JOINs and has even more issues about dialect, parser and engine variations.

Again, dialects vary: in HiveQL you get the windowing functions but not the QUALIFY clause forcing another round of a table-within-a-table.

  • * I use the old fashioned term of "derived table" for the replacement of a FROM clause with an inner SELECT. The modern trend is to do these using a WITH clause instead. There's much more to it than that, so that's yet another topic.

p.s. predating the use of AS for aliases, some dialects had a (NAMED) mechanism. Strangely these applied at the text level before any real SQL parsing. This would let you specify a name for something inside an expression and re-use it elsewhere in the query - thus only typing it once and no risk of typos in the additional uses. I know dialects where the AS also works like a copy-paste "macro" rather than indicating something should be calculated once at run-time and then re-used rather than independently re-derived. The irony is that by now most engines have optimisers that will spot the repeated derivations regardless of whether they have the same alias or not.

Collapse
 
shaqq profile image
Shaker Islam

Thanks for the detailed explanation, though I believe we agree that this still hearkens back to #10.

Collapse
 
geraldew profile image
geraldew

Ah, nothing quite like a rant to air some heartfelt observations. I approve of the honesty you've expressed here so I'll try to merely give views based on my experience. I'll apologise ahead if I come across as too "teachy".

With a list of ten things there, I'll use a series of response posts wherever I think I have something to say rather than attempting it all in one go.

So, #1 SQL Syntax

I think it's fair to say that the syntax of SQL is a bit of a mess. It's clearly just evolved and meandered over the years and we're probably lucky that it has not become utterly dreadful.

That said:

  • I can't think of any languages that tolerate lists with a dangling comma - with this applying to both your examples;
  • the usage of single and double quotes took some time to evolve into the semi-standard of now. Their usage varies a lot among programming languages so there's no single comparison to make with all those. I still deal daily with two dialects with contrasting approaches (more on this later). In quite a few dialects they were interchangeable. The main need for them comes from two issues:
    • not really having reserved words - that would disallow a column named "from" or "table" - therefore the quotes let you be clear that you didn't mean the reserved word;
    • allowing for column/table names with spaces in them (me, I would have just said No to that, but once any of them allowed it then the others would all need to as well)* ;
  • regarding multiple WHERE clauses I'm guessing you think it's obvious that having more than would be an AND combination. From where I sit, neither AND nor OR are more obvious than the other, so I'm happy we have to be clear about which is being used. As we'll see later, WHERE has a more distinct meaning than just limiting things.

Expressions of angst are one thing, but declaring features you don't (yet) like as "errors" is overreach when you're not showing how you'll avoid inconsistencies. For example you don't indicate which interpretation you think should be made for your dangling commas. I can easily imagine two versions of you with different obvious opinions:

  • In one, the dangling comma should just be ignored as if you didn't mean it;
  • In another, it obviously implies a NULL expression as an extra column.

The funny thing, in teaching many people SQL over the years it is the latter that I've had instructees suggest as obvious - often when the SELECT was just for supplying an INSERT - and who then disliked the bother of doing CAST( NULL AS INTEGER) such as is often required by various parsers or engines.

As for "syntax rules also don’t add much" I would mainly say - they add clarity (to a degree anyway). When you work with millions (and even billions) of rows, that clarity is more than just handy.

p.s. actually there are some gnarlier issues that you may not have encountered yet. I have a habit of putting CASE expressions inside their own brackets () because some dialects can't parse them when nested inside more expressions e.g. 3 + ( CASE WHEN col = 'A' THEN 1 ELSE 0 END )

  • * a curious reality about SQL is that it's not unusual to be using dialect A to deal with tables/columns constructed with a dialect B. While the early waves of relational databases locked their tables within their own engines, soon APIs and drivers enabled engines to access each other's tables. This isn't just a problem about databases, one runs into similar problems with file systems - a reason to use Linux to fix Windows systems is to avoid the latter having some impossible filenames (a whole other topic that).
Collapse
 
shaqq profile image
Shaker Islam

Hi @geraldew - thanks for the detailed responses! I'll try to answer them one by one, but I want to mention that I appreciate the conversation!

Ok here goes

I can't think of any languages that tolerate lists with a dangling comma - with this applying to both your examples

Are you talking about Query languages, specifically? Because this is a feature in many modern programming languages, including Ruby & Python

the usage of single and double quotes took some time to evolve into the semi-standard of now. Their usage varies a lot among programming languages so there's no single comparison to make with all those. I still deal daily with two dialects with contrasting approaches (more on this later). In quite a few dialects they were interchangeable. The main need for them comes from two issues:

I agree there is a need to specify identifiers, and I do believe that backticks would be much more useful than double quotes. Theoretically this type of feature could be behind a feature-flag to preserve backwards compat.

Expressions of angst are one thing, but declaring features you don't (yet) like as "errors" is overreach when you're not showing how you'll avoid inconsistencies. For example you don't indicate which interpretation you think should be made for your dangling commas. I can easily imagine two versions of you with different obvious opinions:

The fix here would be to ignore everything past the dangling comma. This is the expected behavior in many modern programming languages. Expecting a dangling comma to have a null in either a CTE or select block seems very strange and unexpected.

As for "syntax rules also don’t add much" I would mainly say - they add clarity (to a degree anyway). When you work with millions (and even billions) of rows, that clarity is more than just handy.

I do believe there's a tradeoff between clarity and UX. In this case, I believe the tradeoffs for better usability are worth it.

Collapse
 
geraldew profile image
geraldew

Number 10 SQL implementations are inconsistent

I've jumped to this one because a couple of my other responses alluded to it.

In a rough sense, you're completely right on this point. The variations are maddening, and you can be years into using a dialect before realising that you've been presuming it worked the same as some other one before finding there was a slight difference.

However, from my position, where I've been watching some of these evolve over the decades, I can say I'm glad things never got set in stone by the "standard". While there have been inevitable dead ends, by and large it has been the variations and vendor experiments that have let SQL somehow stay SQL and yet gather features that we mostly find useful.

The caveat though, is that SQL has always been the QWERTY of relational scripting work. You couldn't call it a well designed "language" because it simply isn't. As you point out elsewhere, its flaws are such that there's no clear path to "fix" it.

It doesn't even really do "relational" all that well - you'll find criticisms of that kind going a long way back if you search for them. As dialects and engines evolve with new features such as table expressions, it's tempting to say they're drifting even further from being "relational". There's always a lot to argue and discuss.

When I'm teaching colleagues new to it, my first words are usually "SQL is crap language". Like the "standard" keyboard, we all know we should have something better but we persevere with what we have. I don't really see any way of defending that (and the matter of alternative relational languages is yet another topic).

Collapse
 
shaqq profile image
Shaker Islam

However, from my position, where I've been watching some of these evolve over the decades, I can say I'm glad things never got set in stone by the "standard". While there have been inevitable dead ends, by and large it has been the variations and vendor experiments that have let SQL somehow stay SQL and yet gather features that we mostly find useful.

With respect, I can't disagree with this more. Drifting SQL dialects is a massive pain point, one also felt by Java on the JVM, and by browsers in the early days. In the grand scheme of things this really hurts the users and creates a sense of vendor lock-in that isn't necessary.

Collapse
 
geraldew profile image
geraldew

Next, #4. Select statements are not dynamic

Certainly some data engines provide features just like this, so I guess your point is that these have not migrated into such standard use as to be in them all, and to then not be handled with a language standard.

For example, Teradata has two of these:

  • Temporary tables, whose existence only lasts with the session. Disconnect and the table vanishes.
  • Global Temporary tables, whose definition survives across sessions but the data in them vanishes with the sessions.

Neither is the exactly the same as what you want. And all present some form of management issue when what you're writing needs to shift from exploratory code to being a production script of data actions in sequence to run again and again.

e.g. To redo a temporary within a session you still have to do a DROP TABLE to get rid of it before recreating it differently.

Collapse
 
shaqq profile image
Shaker Islam

See complaint #10 😄

It doesn't help if some SQL dialect, somewhere, has a feature. I need it for all of them!

Collapse
 
geraldew profile image
geraldew

Next #2 UPPERCASE

Ah, the classic language argument - and one with no winners.

Except of course, this is not a requirement at all with SQL, it's just a common style. Do what you like, your code will work.

Personally I like it, but then I date from programming in Modula-2 where it was mandatory as well as a time before colour displays and syntax highlighting. In the green-on-black screens and crude dot matrix printouts, the UPPERCASE words stood out nicely.

Maybe that colours my view, but it means with your examples, yes, the one with uppercase keywords was easier for me to read.

I suspect what you want/need is a syntax highlighting editor that does CASE display change along with colour.

And we haven't even discussed line and indent formatting of SQL scripts, which is where the real bun fights take place. I'd say the various habits and preferences of that make the uppercase/lowercase thing look like a mild issue.

Collapse
 
shaqq profile image
Shaker Islam

Of course! This is one of many the "petty" complaints I had.

The usage of uppercase and formatting & such is very subjective. In my view, there's at least a few things to note against the use of uppercase:

  1. Uppercase as a format is much harder to type. Often when we're writing queries, we want to quickly iterate, and even small roadblocks can be frustrating when we want to focus on analysis
  2. While it's true that uppercase was much more common for older programming languages, it's not true for popular programming languages in the past 20 years (that I'm aware of). It's simply not a format that the next generation of analysts & engineers are used to, and we should adjust.
Collapse
 
defaultkramer profile image
Ryan Kramer • Edited

I'm always glad to see people who know SQL well call out its flaws. It is amazing to me that SQL is still the best option we have available to talk to our DBs. Even Typescript has gotten wide adoption as a better Javascript, but for SQL I don't expect any alternative (including my own) to gain wide adoption soon.

You might be interested in my library/language Plisqin, although if you've never used a Lisp-family language it might be too inaccessible: docs.racket-lang.org/plisqin/index...

Items 3 and 6 can be easily solved using a local variable for century like so

(from b 'books
      (define century
        (%%scalar b".publish_year / 100"))
      (select century)
      (%%where century" between 16 and 19"))
Enter fullscreen mode Exit fullscreen mode

Item 4 can be solved in a lot of ways, I think defining a list of column names would be easiest.

Item 5 is trivial -- you simply create a normal Racket procedure which takes whatever variables it needs. This procedure builds and returns the query. Just like a view, you can build a larger query with the result by adding where clauses or joins or whatever you want.

Item 10 is unfortunately a huge pain for me.

One of my biggest gripes about SQL is how often I end up repeating joins. For example, imagine every Book has exactly one Author. If I have a Book and want to talk about the AuthorName, I have to manually include the join from Book to Author every single time! [1] Why should I even have to know that a join is involved? I know that AuthorName is a valid property of a Book, I shouldn't even have to care whether it is a direct property of the Book table or a property of some joined table. Plisqin allows me to define a function "AuthorName given an instance of Book" and then I can reuse that function just the same as I would every other property of Book.

[1]: Yes, I could create a "BookWithAuthor" view, but then maybe I need a "BookWithAuthorAndPublisher" view also, and a "BookWithPublisher" view... The combinations become unmanageable very quickly.

Collapse
 
mhd profile image
Michael Dingler

I can live with uppercase, but then again, I've enjoyed programming Modula-2/Oberon in the past, and it's easy enough to automate it either way. But know what really grinds my gears? Excessive quoting of everything that isn't a keyword. Backtick heaven. (Seems more common in some databases than others)

I'm surprised, too, that there isn't more tooling around SQL-as-a-language, as opposed to specifc RDBMS. Compare it with HTML (another bloated beast). There you can have different syntaxes in template systems (like HAML), and alternate convenient ways to enter things (like Emmet). Not aware of anything like that for SQL. (And no, "expressive ORMS" aren't that, they fall into the same unreadable trap like DSLs that generate HTML. Or JSX, in the worst case).

Collapse
 
alexktracermain profile image
AlexKTracerMain

Thanks, just learning SQL and this was therapeutic. I'm not crazy, SQL is just deeply deeply flawed and extremely hard to learn. No reason it can't more elegantly be like real code where you call other tables you create like re-usable tables. (this is possible, but seems hard/annoying)

The relative immaturity is astonishing for such a massively used 'language'.

Collapse
 
antonio_nakialfirevi_3 profile image
Antonio Nakić-Alfirević • Edited

As a developer, I miss being able to refactor code and navigate to symbols.

For example, if I rename a column, I'd like it to automatically get renamed in any view or stored procedures that reference it. Unfortunately, this does isn't the case. I have to remember to find all places where that column might be used and rename it manually.

Or, if I'm referencing a view in my code and I want to peek its definition, I'd like to be able to press F12 and navigate to it. This is more of a tooling than a language issue, but it's not something I usually see in SQL IDEs.

It just feels like user code in the database is disjointed from the schema.

Collapse
 
rolandbouman profile image
Roland Bouman

While I do think SQL leaves a number of things to desire, I don't see much in the objections you raise here.

Some items have already been adressed by @geraldew , I just wanted to add:

6. No variables in ANSI SQL

There is a good reason for that - variables are an imperative device and in many cases using them would force a definite execution order, which makes it harder (impossible) to optimize the query. This is not unique to SQL, you see this in functional languages as well.

For the same reason, if your SQL dialect lets you use user-defined functions, then they should typically be free of side-effects. If they are not, unexpected (bad) things will happen.

Now, about your example, you wrote:

(publish_year / 100) as century,
case
    when century < 0 then (century * -1) || ' B.C.'
    when century > 0 then century || ' A.D.'
as formatted_century
Enter fullscreen mode Exit fullscreen mode

It now looks like you would need to duplicate the publish_year/100 twice, and this is what you would solve with a variable.

Now, typically you'd want to push this type of formatting to a reporting tool. But if you really want to do it in a query, you'd simply use a built-in date formatting function. And, you'd make your life considerably easier by storing the publish_year as a date.

With that in place, the century expression would become (showing postgres examples):

to_char(publish_date, 'CC')

and

abs(cast(to_char(publish_date, 'CC') as int))||' '||to_char(publish_date, 'B.C.')

Now, you could still nitpick over the duplication of the 'CC' formatting of the publish_date but the point is that usually when people say they need variables in SQL, they really don't, they just need to use built-in functions to do the job for you.
(see: postgresql.org/docs/8.2/functions-...)

Collapse
 
shaqq profile image
Shaker Islam • Edited

Thanks for responding, @rolandbouman

Now, typically you'd want to push this type of formatting to a reporting tool. But if you really want to do it in a query, you'd simply use a built-in date formatting function. And, you'd make your life considerably easier by storing the publish_year as a date.

With respect, this is a classic example of "this isn't the product's problem, it's the user's problem." Yes, of course - users have been dealing with this and using workaround for years now. My question is - can this be better?

variables are an imperative device and in many cases using them would force a definite execution order, which makes it harder (impossible) to optimize the query

You're saying it's impossible to have some sense of variables in SQL, with some sane guardrails, and be able to optimize a query? I doubt that, given that there are SQL dialects that have variables.

If dividing one number by another static number is an execution that must be done before others, I think that's a fine tradeoff for most/all users. There must be a better way.