I was reading the article, and I thought why I won't do the same experiment to understand the viewpoint of the author.
The IMDB data
On the page where the datasets are documented I already see that the datasets aren't fully normalized.
A few examples:
- in title.basics.tsv is genres a string array. The genres belong in a separate table and there should be a pivot table.
- title.crew.tsv is just a mess of pivot tables. This should be title_directors and title_writers. But these tables aren't needed because of the following dataset.
- title.principals.tsv is also a mess but more subtle. The category column should be in its own table. I would make a title_crew table with the title id, name id, category id, job title and ordering. And also a title_character table with title id, name id and character.
In the article the author was complaining about joins in queries, I don't want to know how tedious he will find it when he sees the fully normalized tables for the datasets.
The main problem for the author and my view
As I understand it the main problem is the transformation from data storage to data display format.
For this data I would use a graph database because it is relationship heavy and the data of the items is most of the times a few fields.
It would be easier to create a single query, but I would not do the JSON transformation in the query.
With this last point I think I got to the main flaw of the post.
If you want to store the data as documents that is OK. There is a database type that allows you to do that.
The thing is that most of the times an application needs multiple forms of the same data for different displays. That is the reason the data is stored in relational and graph database types.
Those databases have their own ways to query the stored data as efficient as possible, and once you commit to a database type you should accept that.
The problem I have with the solution of the author is the use of sub-queries. For a single query in this case I rely on UNION
.
select primary_title, null as director_name, null as player_name, null as character_name
from titles
where tconst = 'tt3890160'
union
select null as primary_title, n.primary_name as director_name, null as player_name, null as character_name
from principals
inner join names n on principals.nconst = n.nconst
where tconst = 'tt3890160'
and category = 'director'
union
select null as primary_title, null as director_name, n.primary_name as player_name, characters as character_name
from principals
inner join names n on principals.nconst = n.nconst
where tconst = 'tt3890160'
and (category = 'actor' or category = 'actress')
This creates rows that the application can filter on the column values to get to the wanted display form.
I assume that the author looks at the data storage with a frontend rendering mindset. And then is the document form the best way, because it can get the data with a single request.
When the frontend is rendered on the server, the single request optimization is not needed and you can query the database the most efficient way for the database type you choose.
The one thing I want you to remember from this post is that the display form should not dictate the storage form.
Top comments (0)