david duymelinck

Posted on Sep 7

My 2 cents on "SQL needed structure"

#webdev #sql #database

I was reading the article, and I thought why I won't do the same experiment to understand the viewpoint of the author.

The IMDB data

On the page where the datasets are documented I already see that the datasets aren't fully normalized.

A few examples:

in title.basics.tsv is genres a string array. The genres belong in a separate table and there should be a pivot table.
title.crew.tsv is just a mess of pivot tables. This should be title_directors and title_writers. But these tables aren't needed because of the following dataset.
title.principals.tsv is also a mess but more subtle. The category column should be in its own table. I would make a title_crew table with the title id, name id, category id, job title and ordering. And also a title_character table with title id, name id and character.

In the article the author was complaining about joins in queries, I don't want to know how tedious he will find it when he sees the fully normalized tables for the datasets.

The main problem for the author and my view

As I understand it the main problem is the transformation from data storage to data display format.

For this data I would use a graph database because it is relationship heavy and the data of the items is most of the times a few fields.
It would be easier to create a single query, but I would not do the JSON transformation in the query.

With this last point I think I got to the main flaw of the post.
If you want to store the data as documents that is OK. There is a database type that allows you to do that.
The thing is that most of the times an application needs multiple forms of the same data for different displays. That is the reason the data is stored in relational and graph database types.
Those databases have their own ways to query the stored data as efficient as possible, and once you commit to a database type you should accept that.

The problem I have with the solution of the author is the use of sub-queries. For a single query in this case I rely on UNION.

select primary_title, null as director_name, null as player_name, null as character_name
from titles
where tconst = 'tt3890160'

union

select null as primary_title, n.primary_name as director_name, null as player_name, null as character_name
from principals
         inner join names n on principals.nconst = n.nconst
where tconst = 'tt3890160'
  and category = 'director'

union

select null as primary_title, null as director_name, n.primary_name as player_name, characters as character_name
from principals
         inner join names n on principals.nconst = n.nconst
where tconst = 'tt3890160'
  and (category = 'actor' or category = 'actress')

This creates rows that the application can filter on the column values to get to the wanted display form.

I assume that the author looks at the data storage with a frontend rendering mindset. And then is the document form the best way, because it can get the data with a single request.
When the frontend is rendered on the server, the single request optimization is not needed and you can query the database the most efficient way for the database type you choose.

The one thing I want you to remember from this post is that the display form should not dictate the storage form.

Top comments (4)

Ravavyr • Sep 12 • Edited

Oh i totally forgot myself. Of course i threw it into chatGPT to see what it would say:

Flaws in the original query:

Multiple UNION scans: The same principals table is scanned twice, once for directors and once for actors/actresses. That’s redundant.
OR condition (actor or actress): Slows queries; better handled with IN.
NULL placeholders: Forces wide unions that aren’t efficient if the goal is structured data. Returning role + value is cleaner.
No indexing hint: Filtering on tconst and category benefits from a composite index.

This is the optimized query it provided using union all:

` select t.primary_title,
null as director_name,
null as player_name,
null as character_name
from titles t
where t.tconst = 'tt3890160'

union all

select null as primary_title,
n.primary_name as director_name,
null as player_name,
null as character_name
from principals p
join names n on p.nconst = n.nconst
where p.tconst = 'tt3890160'
and p.category = 'director'

union all

select null as primary_title,
null as director_name,
n.primary_name as player_name,
p.characters as character_name
from principals p
join names n on p.nconst = n.nconst
where p.tconst = 'tt3890160'
and p.category in ('actor','actress');
`

However this query below can get your the same data in a slightly different output format so you'd need to slightly adjust your processing code.
select t.primary_title, case when p.category = 'director' then n.primary_name end as director_name, case when p.category in ('actor','actress') then n.primary_name end as player_name, case when p.category in ('actor','actress') then p.characters end as character_name from titles t left join principals p on t.tconst = p.tconst left join names n on p.nconst = n.nconst where t.tconst = 'tt3890160';

david duymelinck • Sep 12

I ran the left join query and it took double the time than the union query.
The setup of my database is sqlite and the tables just like the IMDB import files with no indexes.

I agree I didn't tweak the union for maximum performance. But that was not the goal of the union example. The query in the article used subqueries to get to a resultset, and those are notoriously slow.

Ravavyr • Sep 12

"display form should not dictate storage form"...
But it's nice when they are very similar, plus if your display object matches something you can easily pull from your storage then the queries will be more optimal too.

Also sometimes a JOIN is slower than just querying two tables at once. It depends on how large the dataset is but a JOIN will be slower when it's more data.

for example:
select null as primary_title, n.primary_name as director_name, null as player_name, null as character_name from principals inner join names n on principals.nconst = n.nconst where tconst = 'tt3890160' and category = 'director'

can be:
SELECT null as primary_title, n.primary_name as director_name, null as player_name, null as character_name FROM principals as p, names as n WHERE p.nconst = n.nconst AND tconst = 'tt3890160' AND category = 'director'

Setting null values does seem like a bad approach, but that's a different discussion as i don't know the reasonining behind it. My focus was just doing a two table query versus a join where possible.

david duymelinck • Sep 12

But it's nice when they are very similar

I agree. If the application allows you to store the data as close to the expected output as possible, you should do it.