For years, I wasn’t working with relational databases. But in my heart, I’ve always loved them and thrived when developing with them.
What follows is my iterating through some SQL statements to both better understand the database and also the content. I’ve been using Blazer, as written about on Dev.to by Molly. As part of my job, I have access to the DEV.to environment to run these queries.
One task in front of me is to improve on the Feed Algorithm. To understand that, I need to understand the data. In the current incarnations we rely heavily on two scores:
To understand the algorithms, I need to look at those values. Here’s a query to understand the possible range of scores:
SELECT min(score), max(score), avg(score), stddev(score), percentile_cont(0.5) within GROUP(ORDER BY score) mean, percentile_cont(0.95) within GROUP(ORDER BY score) percentile_95 FROM articles;
There’s a lot of variance in the distribution of the score, and anything we add to that score for sorting will be inconsistent.
SELECT min(hotness_score), max(hotness_score), avg(hotness_score), stddev(hotness_score), percentile_cont(0.5) within GROUP(ORDER BY hotness_score) mean, percentile_cont(0.95) within GROUP(ORDER BY hotness_score) percentile_95 FROM articles;
And the situation is even more swingy with the hotness score. The underlying implementation works for smaller communities, but as DEV.to grows, the swingyness becomes more and more notable.
The particular challenge is that the score or hotness score is the starting point for all feed queries. In otherwords, to get a feed tailored for the user, there’s a lot of variance to overcome.
So I decided to approach from a different angle. What I have is incomplete, but I’m considering the structure as a strong competetor to compete against the current feed algorithm.
First, I wanted to build some successes by writing some quick and easy queries. You know, practice a little bit. In the examples, I have
:user_id as the placeholder for my personal DEV.to account number.
The following query is published articles that exclude those published by people I’ve blocked:
SELECT articles.id, title, published_at FROM articles LEFT OUTER JOIN user_blocks ON user_blocks.blocked_id = articles.user_id AND user_blocks.blocked_id IS NULL AND user_blocks.blocker_id = :user_id WHERE published = true ORDER BY published_at DESC LIMIT 20;
Select the tags that I follow:
SELECT tags.* FROM tags INNER JOIN follows ON tags.id = follows.followable_id AND follows.followable_type = 'ActsAsTaggableOn::Tag' WHERE follows.follower_id = :user_id AND follows.follower_type = 'User';
Select the organizations that I follow:
SELECT organizations.* FROM organizations INNER JOIN follows ON organizations.id = follows.followable_id AND follows.followable_type = 'Organization' WHERE follows.follower_id = :user_id AND follows.follower_type = 'User';
Select the users that I follow:
SELECT users.* FROM users INNER JOIN follows ON users.id = follows.followable_id AND follows.followable_type = 'User' WHERE follows.follower_id = :user_id AND follows.follower_type = 'User';
Then I started doing computations in query. This query creates a freshness score based solely on the publication date and excluding articles from those people I blocked:
SELECT (1 - (current_date - published_at::date) * 0.1) AS freshness_score, articles.title, STRING_AGG(tags.name, ', ') AS tag_names FROM articles LEFT OUTER JOIN user_blocks ON user_blocks.blocked_id = articles.user_id AND user_blocks.blocked_id IS NULL AND user_blocks.blocker_id = :user_id LEFT OUTER JOIN taggings ON taggings.taggable_id = articles.id AND taggable_type = 'Article' INNER JOIN tags ON taggings.tag_id = tags.id WHERE published = true GROUP BY freshness_score, articles.title ORDER BY freshness_score DESC LIMIT 20
You can see the query in Blazer.
Next, I started thinking about what I wanted out of this. My initial brainstorm was that I want to create a scalar value for an article:
- If the article has one tag that I follow, I want to have a
- If it has two or more tags that I follow, I want it to have a
- If it has no articles that I follow, I want it to have a
Using the above, I then want to select all articles that were published within the last 7 days. Articles published today have a score of 1. All other articles have a score of
1 - ((number of days since published) * 0.1). I want to multiply the above scalar value for the article by the score to create a confidence value. And I want the query sorted from highest confidence value to lowest.
In wanted a query that tells me how many of an article’s tags I follow of the top published articles.
The algorithm embedded in the following SQL statement is as follows:
- The highest relevance score is 1.
- Anything posted prior to today has a lower cap. One day old is 0.9, two days old is 0.8, etc.
- If I don’t follow any of the tags, I’m not overly interested. The more tags that I follow that are part on the article, the higher the value.
- If I don’t follow the user, that’s a slight penalty.
- If I don’t follow the organization, that’s a slight penalty.
- Give a bit more weight to someone I follow leaving a comment.
- Give a bit of weight to the # of comments (but don’t over indulge in flame wars)
- Really discredit spam.
The scalars that I chose in the SQL statement are arbitrary, and we’d definitely want to tweak the values. The query’s not yet ready for primetime, but I’m curious to see it’s effect.
WITH top_articles AS ( SELECT articles.id, ( (1 - (current_date - published_at::date) * 0.05) * (CASE articles.spaminess_rating WHEN 0 THEN 1 ELSE 0.05 END) * (CASE ABS(articles.experience_level_rating - 5) WHEN 0 THEN 1 WHEN 1 THEN 0.98 WHEN 2 THEN 0.97 WHEN 3 THEN 0.96 WHEN 4 THEN 0.95 WHEN 5 THEN 0.94 ELSE 0.93 END) * (CASE COUNT(comments.id) WHEN 0 THEN 0.95 WHEN 1 THEN 0.98 WHEN 2 THEN 0.99 ELSE 1 END) * (CASE COUNT(followed_tags.follower_id) WHEN 0 THEN 0.4 WHEN 1 THEN 0.9 ELSE 1 END) * (CASE articles.comments_count WHEN 0 THEN 0.9 WHEN 1 THEN 0.94 WHEN 2 THEN 0.95 WHEN 3 THEN 0.98 WHEN 4 THEN 0.999 ELSE 1 END) * (CASE COUNT(followed_user.follower_id) WHEN 0 THEN 0.8 WHEN 1 THEN 1 ELSE 1 END) * (CASE COUNT(followed_org.follower_id) WHEN 0 THEN 0.95 WHEN 1 THEN 1 ELSE 1 END)) AS relevance_score, articles.experience_level_rating, articles.published_at, articles.comments_count, STRING_AGG(tags.name, ', ') AS tag_names, articles.title FROM articles LEFT OUTER JOIN taggings ON taggings.taggable_id = articles.id AND taggable_type = 'Article' INNER JOIN tags ON taggings.tag_id = tags.id LEFT OUTER JOIN follows AS followed_tags ON tags.id = followed_tags.followable_id AND followed_tags.followable_type = 'ActsAsTaggableOn::Tag' AND followed_tags.follower_type = 'User' AND followed_tags.follower_id = :user_id LEFT OUTER JOIN follows AS followed_user ON articles.user_id = followed_user.followable_id AND followed_user.followable_type = 'User' AND followed_user.follower_id = :user_id AND followed_user.follower_type = 'User' LEFT OUTER JOIN follows AS followed_org ON articles.organization_id = followed_org.followable_id AND followed_org.followable_type = 'Organization' AND followed_org.follower_id = :user_id AND followed_org.follower_type = 'User' LEFT OUTER JOIN comments ON comments.commentable_id = articles.id AND comments.commentable_type = 'Article' AND followed_user.followable_id = comments.user_id AND followed_user.followable_type = 'User' LEFT OUTER JOIN user_blocks ON user_blocks.blocked_id = articles.user_id AND user_blocks.blocked_id IS NULL AND user_blocks.blocker_id = :user_id WHERE published = true GROUP BY articles.id, articles.title, articles.published_at, articles.comments_count, articles.experience_level_rating, articles.spaminess_rating ORDER BY relevance_score DESC, articles.published_at DESC LIMIT 25) SELECT articles.* FROM articles INNER JOIN top_articles ON top_articles.id = articles.id;
The end result is a
relevance_score between 0 and 1 (well I suppose there could be negative numbers but that’s if we have even older articles).
There are more things I want to consider, and could add into the above query:
- I need to account for what to do with moderator’s up and down voting.
- I need to account for folks who’ve given a negative weight to a tag.
I also went ahead and submitted a draft PR to begin having more conversations.