Let’s clear the air quick — SQL is awesome. It’s the most consistently used query language for well over two decades. There’s a ton of love for SQL...
For further actions, you may consider blocking this person and/or reporting abuse
On to #3 Aliases don’t work after the Select block
This is very much a matter of specific dialects - therefore of their engines and parsers.
For some, what you have there would be allowed and would do what you expect, but for others not.
In my current workplace:
For HiveQL the problem seems to me to be the parser, as it's quite trivial to wrap the SELECT as a derived table and then do same WHERE clause. When used with the Hive-on-Spark engine, I'm fairly sure that the underlying engine does it in the one pass.
Which is to say that often the rant is about the implementation not the SQL language (but once again, more on this later).
Interestingly, your comment that "Some SQL engines will allow you to use having here instead" - that's not something I've encountered so I'll assume you're correct. I've only ever used HAVING with a GROUP BY.
Maybe this is a good point to bring in that modern SQL uses at least three variants of the WHERE clause. There's:
where each applies to a separate layer/round of under-the-hood processing. The overview being:
Note: I'm deliberately leaving out the use of WHERE as a de-facto JOIN controller. That's a topic really about JOINs and has even more issues about dialect, parser and engine variations.
Again, dialects vary: in HiveQL you get the windowing functions but not the QUALIFY clause forcing another round of a table-within-a-table.
p.s. predating the use of AS for aliases, some dialects had a (NAMED) mechanism. Strangely these applied at the text level before any real SQL parsing. This would let you specify a name for something inside an expression and re-use it elsewhere in the query - thus only typing it once and no risk of typos in the additional uses. I know dialects where the AS also works like a copy-paste "macro" rather than indicating something should be calculated once at run-time and then re-used rather than independently re-derived. The irony is that by now most engines have optimisers that will spot the repeated derivations regardless of whether they have the same alias or not.
Thanks for the detailed explanation, though I believe we agree that this still hearkens back to #10.
Ah, nothing quite like a rant to air some heartfelt observations. I approve of the honesty you've expressed here so I'll try to merely give views based on my experience. I'll apologise ahead if I come across as too "teachy".
With a list of ten things there, I'll use a series of response posts wherever I think I have something to say rather than attempting it all in one go.
So, #1 SQL Syntax
I think it's fair to say that the syntax of SQL is a bit of a mess. It's clearly just evolved and meandered over the years and we're probably lucky that it has not become utterly dreadful.
That said:
Expressions of angst are one thing, but declaring features you don't (yet) like as "errors" is overreach when you're not showing how you'll avoid inconsistencies. For example you don't indicate which interpretation you think should be made for your dangling commas. I can easily imagine two versions of you with different obvious opinions:
The funny thing, in teaching many people SQL over the years it is the latter that I've had instructees suggest as obvious - often when the SELECT was just for supplying an INSERT - and who then disliked the bother of doing
CAST( NULL AS INTEGER)
such as is often required by various parsers or engines.As for "syntax rules also don’t add much" I would mainly say - they add clarity (to a degree anyway). When you work with millions (and even billions) of rows, that clarity is more than just handy.
p.s. actually there are some gnarlier issues that you may not have encountered yet. I have a habit of putting CASE expressions inside their own brackets
()
because some dialects can't parse them when nested inside more expressions e.g. 3 + ( CASE WHEN col = 'A' THEN 1 ELSE 0 END )Hi @geraldew - thanks for the detailed responses! I'll try to answer them one by one, but I want to mention that I appreciate the conversation!
Ok here goes
Are you talking about Query languages, specifically? Because this is a feature in many modern programming languages, including Ruby & Python
I agree there is a need to specify identifiers, and I do believe that backticks would be much more useful than double quotes. Theoretically this type of feature could be behind a feature-flag to preserve backwards compat.
The fix here would be to ignore everything past the dangling comma. This is the expected behavior in many modern programming languages. Expecting a dangling comma to have a
null
in either a CTE orselect
block seems very strange and unexpected.I do believe there's a tradeoff between clarity and UX. In this case, I believe the tradeoffs for better usability are worth it.
Number 10 SQL implementations are inconsistent
I've jumped to this one because a couple of my other responses alluded to it.
In a rough sense, you're completely right on this point. The variations are maddening, and you can be years into using a dialect before realising that you've been presuming it worked the same as some other one before finding there was a slight difference.
However, from my position, where I've been watching some of these evolve over the decades, I can say I'm glad things never got set in stone by the "standard". While there have been inevitable dead ends, by and large it has been the variations and vendor experiments that have let SQL somehow stay SQL and yet gather features that we mostly find useful.
The caveat though, is that SQL has always been the QWERTY of relational scripting work. You couldn't call it a well designed "language" because it simply isn't. As you point out elsewhere, its flaws are such that there's no clear path to "fix" it.
It doesn't even really do "relational" all that well - you'll find criticisms of that kind going a long way back if you search for them. As dialects and engines evolve with new features such as table expressions, it's tempting to say they're drifting even further from being "relational". There's always a lot to argue and discuss.
When I'm teaching colleagues new to it, my first words are usually "SQL is crap language". Like the "standard" keyboard, we all know we should have something better but we persevere with what we have. I don't really see any way of defending that (and the matter of alternative relational languages is yet another topic).
With respect, I can't disagree with this more. Drifting SQL dialects is a massive pain point, one also felt by Java on the JVM, and by browsers in the early days. In the grand scheme of things this really hurts the users and creates a sense of vendor lock-in that isn't necessary.
Next, #4. Select statements are not dynamic
Certainly some data engines provide features just like this, so I guess your point is that these have not migrated into such standard use as to be in them all, and to then not be handled with a language standard.
For example, Teradata has two of these:
Neither is the exactly the same as what you want. And all present some form of management issue when what you're writing needs to shift from exploratory code to being a production script of data actions in sequence to run again and again.
e.g. To redo a temporary within a session you still have to do a DROP TABLE to get rid of it before recreating it differently.
See complaint #10 😄
It doesn't help if some SQL dialect, somewhere, has a feature. I need it for all of them!
Next #2 UPPERCASE
Ah, the classic language argument - and one with no winners.
Except of course, this is not a requirement at all with SQL, it's just a common style. Do what you like, your code will work.
Personally I like it, but then I date from programming in Modula-2 where it was mandatory as well as a time before colour displays and syntax highlighting. In the green-on-black screens and crude dot matrix printouts, the UPPERCASE words stood out nicely.
Maybe that colours my view, but it means with your examples, yes, the one with uppercase keywords was easier for me to read.
I suspect what you want/need is a syntax highlighting editor that does CASE display change along with colour.
And we haven't even discussed line and indent formatting of SQL scripts, which is where the real bun fights take place. I'd say the various habits and preferences of that make the uppercase/lowercase thing look like a mild issue.
Of course! This is one of many the "petty" complaints I had.
The usage of uppercase and formatting & such is very subjective. In my view, there's at least a few things to note against the use of uppercase:
I'm always glad to see people who know SQL well call out its flaws. It is amazing to me that SQL is still the best option we have available to talk to our DBs. Even Typescript has gotten wide adoption as a better Javascript, but for SQL I don't expect any alternative (including my own) to gain wide adoption soon.
You might be interested in my library/language Plisqin, although if you've never used a Lisp-family language it might be too inaccessible: docs.racket-lang.org/plisqin/index...
Items 3 and 6 can be easily solved using a local variable for
century
like soItem 4 can be solved in a lot of ways, I think defining a list of column names would be easiest.
Item 5 is trivial -- you simply create a normal Racket procedure which takes whatever variables it needs. This procedure builds and returns the query. Just like a view, you can build a larger query with the result by adding where clauses or joins or whatever you want.
Item 10 is unfortunately a huge pain for me.
One of my biggest gripes about SQL is how often I end up repeating joins. For example, imagine every Book has exactly one Author. If I have a Book and want to talk about the AuthorName, I have to manually include the join from Book to Author every single time! [1] Why should I even have to know that a join is involved? I know that AuthorName is a valid property of a Book, I shouldn't even have to care whether it is a direct property of the Book table or a property of some joined table. Plisqin allows me to define a function "AuthorName given an instance of Book" and then I can reuse that function just the same as I would every other property of Book.
[1]: Yes, I could create a "BookWithAuthor" view, but then maybe I need a "BookWithAuthorAndPublisher" view also, and a "BookWithPublisher" view... The combinations become unmanageable very quickly.
I can live with uppercase, but then again, I've enjoyed programming Modula-2/Oberon in the past, and it's easy enough to automate it either way. But know what really grinds my gears? Excessive quoting of everything that isn't a keyword. Backtick heaven. (Seems more common in some databases than others)
I'm surprised, too, that there isn't more tooling around SQL-as-a-language, as opposed to specifc RDBMS. Compare it with HTML (another bloated beast). There you can have different syntaxes in template systems (like HAML), and alternate convenient ways to enter things (like Emmet). Not aware of anything like that for SQL. (And no, "expressive ORMS" aren't that, they fall into the same unreadable trap like DSLs that generate HTML. Or JSX, in the worst case).
Thanks, just learning SQL and this was therapeutic. I'm not crazy, SQL is just deeply deeply flawed and extremely hard to learn. No reason it can't more elegantly be like real code where you call other tables you create like re-usable tables. (this is possible, but seems hard/annoying)
The relative immaturity is astonishing for such a massively used 'language'.
As a developer, I miss being able to refactor code and navigate to symbols.
For example, if I rename a column, I'd like it to automatically get renamed in any view or stored procedures that reference it. Unfortunately, this does isn't the case. I have to remember to find all places where that column might be used and rename it manually.
Or, if I'm referencing a view in my code and I want to peek its definition, I'd like to be able to press F12 and navigate to it. This is more of a tooling than a language issue, but it's not something I usually see in SQL IDEs.
It just feels like user code in the database is disjointed from the schema.
While I do think SQL leaves a number of things to desire, I don't see much in the objections you raise here.
Some items have already been adressed by @geraldew , I just wanted to add:
6. No variables in ANSI SQL
There is a good reason for that - variables are an imperative device and in many cases using them would force a definite execution order, which makes it harder (impossible) to optimize the query. This is not unique to SQL, you see this in functional languages as well.
For the same reason, if your SQL dialect lets you use user-defined functions, then they should typically be free of side-effects. If they are not, unexpected (bad) things will happen.
Now, about your example, you wrote:
It now looks like you would need to duplicate the publish_year/100 twice, and this is what you would solve with a variable.
Now, typically you'd want to push this type of formatting to a reporting tool. But if you really want to do it in a query, you'd simply use a built-in date formatting function. And, you'd make your life considerably easier by storing the publish_year as a date.
With that in place, the century expression would become (showing postgres examples):
to_char(publish_date, 'CC')
and
abs(cast(to_char(publish_date, 'CC') as int))||' '||to_char(publish_date, 'B.C.')
Now, you could still nitpick over the duplication of the 'CC' formatting of the publish_date but the point is that usually when people say they need variables in SQL, they really don't, they just need to use built-in functions to do the job for you.
(see: postgresql.org/docs/8.2/functions-...)
Thanks for responding, @rolandbouman
With respect, this is a classic example of "this isn't the product's problem, it's the user's problem." Yes, of course - users have been dealing with this and using workaround for years now. My question is - can this be better?
You're saying it's impossible to have some sense of variables in SQL, with some sane guardrails, and be able to optimize a query? I doubt that, given that there are SQL dialects that have variables.
If dividing one number by another static number is an execution that must be done before others, I think that's a fine tradeoff for most/all users. There must be a better way.