Or How I Spent an Afternoon Building Out More Creases for Extensibility in Forem’s Feed Query
In DEV’s relevance feed, we utilize the Articles::Feeds::VariantQuery to perform the correct query for the current DEV feed experiment variants. For those curious, you can look to at the ./config/field_test.yml at github.com/forem/forem.
For each variant, we choose a order by lever. we had two order by levers:
relevancy_score_and_publication_date
final_order_by_random_weighted_to_score
The one I want to write about is the final_order_by_random_weighted_to_score
. It had the following SQL fragment: RANDOM() ^ (1.0 / greatest(articles.score, 0.1)) DESC
. Let’s ignore everything except the RANDOM()
.
I wrote the following Github issue to enumerate the problem: As a person who wants to explore different sort orders I want a bit of control over fully random numbers.
The consequence of having this RANDOM()
is time you run the query, you get a new random number. The impact is that depending on the use of RANDOM()
when you refresh the page you could see significantly different results.
Yet there is utility in randomization. What to do?
A BASIC Sidebar
In junior high, my dad got a new computer for the family. I started writing some BASIC programs to simulate a simple siege game I had created. The castle defenders attempted to ward off the assailants. It involved lots of randomization. Every time I ran the simulation, I got the exact same result.
What I didn’t realize was that I needed to first seed the randomizer with a random seed. Without that randomized seed, my version of BASIC used the same seed. This meant that when I called the RANDOM
function in BASIC, the sequence was always the same.
My dice were always going to return the sequence of results. But this was well before Google and household internet. So I remained stumped until a year later when I stumbled upon a BASIC manual and had an aha moment.
Back to the Query at Hand
I had two task at hand:
- Create a result set that had a repeatable sequence of random numbers.
- “Join” that result set into the existing
Articles::Feeds::VariantQuery
implementation.
When working with SQL, I always strive to build from the inside out.
Create a Result Set That Had a Repeatable Sequence of Random Numbers
To explore this, I set about writing the simplest query I could make. After some fits and starts, I settled on the following query.
WITH seeder AS (SELECT setseed(0.5))
SELECT *, RANDOM() AS randomized_value
FROM generate_series(1,10), seeder
The above query is logically equivalent to the query below.
SELECT *, RANDOM() AS randomized_value
FROM generate_series(1,10),
(SELECT setseed(0.5)) as seeder
Both have the result set of:
generate_series | setseed | randomized_value |
---|---|---|
1 | empty string | 0.2499104186659835 |
2 | empty string | 0.520017612227381 |
3 | empty string | 0.4611753978720401 |
4 | empty string | 0.22773722382948947 |
5 | empty string | 0.18489966987045392 |
6 | empty string | 0.06851542705898694 |
7 | empty string | 0.12457768622157062 |
8 | empty string | 0.2790778552363484 |
9 | empty string | 0.14696090303700515 |
10 | empty string | 0.6336384228449994 |
The above queries may warrant a bit of discussion. First, it is rare for me to SELECT FROM
two “tables” without a join condition. The generate_series
and setseed
column come from the generate_series(1,10)
and seeder
“tables.” The randomized_value
is the result of the RANDOM()
function.
Now, each time I run it with setseed(0.5)
I get the same randomized_value. If I change the 0.5
to 0.75
I get different values, but again each time I run the query with the seed value, I get the same results.
I had my proof of concept. Next came the hard part. Integrating it into the complicated query.
“Join” That Result Set into the Existing Articles::Feeds::Variantquery
Implementation
This next step took a bit. The primary challenge was in thinking about the result sets.
Conceptually there are two queries happening for the feed:
- Inner Query
- Get a subset of relevant articles for the given user.
- Outer Query
- Order that subset of relevant articles for the given user.
To achieve the desired goal, I needed to add the randomized_value to the list of “columns” for the inner query of relevant articles. With that, we could then do the sorting.
My implementation constraint is ensuring that the the outer query would be an ActiveRecord::Relation
; In sticking to that constraint, I could then use ActiveRecord’s query interface to chain includes
, order
, where
statements. Rather powerful.
My solution was to insert a middle query. Again, the inner query is responsible for retrieving the relevant subset. Then the middle query is responsible for appending the randomized_value
column to the relevant subset. And finally the outer query could then sort using this new randomized_value
instead of RANDOM()
.
Conclusion
When I started writing this post, I thought I might dump a bunch of SQL. But I did some grooming to highlight the difference.
The original query without the seeded randomizer
SELECT "articles".*
FROM "articles"
INNER JOIN (
SELECT articles.id, 1 AS relevancy_score
FROM articles
LEFT OUTER JOIN user_blocks
ON user_blocks.blocked_id = articles.user_id
AND user_blocks.blocked_id IS NULL
AND user_blocks.blocker_id = 322
WHERE articles.published = true
AND articles.published_at > '2022-05-19 20:12:37.777414'
AND articles.published_at < '2022-06-03 20:12:37.777493'
GROUP BY articles.id, articles.published_at
ORDER BY relevancy_score DESC,
articles.published_at DESC
LIMIT 50
) AS article_relevancies
ON articles.id = article_relevancies.id
ORDER BY RANDOM() ^ (1.0 / greatest(articles.score, 0.1)) DESC
The updated query using the seeded randomizer
SELECT "articles".*
FROM "articles"
INNER JOIN (
WITH seeder AS (SELECT setseed(0.5))
SELECT inner_article_relevancies.id,
inner_article_relevancies.relevancy_score,
RANDOM() AS randomized_value
FROM seeder,
(SELECT articles.id, 1 AS relevancy_score
FROM articles
LEFT OUTER JOIN user_blocks
ON user_blocks.blocked_id = articles.user_id
AND user_blocks.blocked_id IS NULL
AND user_blocks.blocker_id = 322
WHERE articles.published = true
AND articles.published_at > '2022-05-19 20:12:37.777414'
AND articles.published_at < '2022-06-03 20:12:37.777493'
GROUP BY articles.id, articles.published_at
ORDER BY relevancy_score DESC,
articles.published_at DESC
LIMIT 50) AS inner_article_relevancies
) AS article_relevancies
ON articles.id = article_relevancies.id
ORDER BY article_relevancies.randomized_value ^
(1.0 / greatest(articles.score, 0.1)) DESC
The main difference between the two? The randomization moves from the outer query to the middle query.
And for those curious, below is the corresponding pull request that incorporates this logic into the feed.
Adding ability to "seed" feed's SQL randomization #17827
What type of PR is this? (check all applicable)
- [x] Feature
Description
The commit includes three changes that work to allow us to "seed" the randomization. When using the same seed, and randomizer should/must return the same sequence of numbers.
First, I added the "seed" parameter to the variant query. If you want the same sequence of random numbers from query to query, pass the same seed. Otherwise, we'll use a Ruby generated random seed.
Second, I added a virtual column to the virtual article_relevancies
table. If you provide a seed, and call the query twice, the first row in each of article_relevancies
will have the same randomized_value
, likewise the second row, etc.
Third, I amended the existing order_by_lever
that had a RANDOM()
postgresql function call. I replaced that with the randomized_value
which is computed based on the provided seed.
Fundamentally the idea is to allow for randomness but introduce possible repeatability; a hard thing considering that some of the inner selection criteria is not fixed (e.g. number of reactions can change from moment to moment).
I wrote a complimentary blog post to further explain what's happening.
Related Tickets & Documents
Related to forem/forem#17826
QA Instructions, Screenshots, Recordings
Please read the above description and complimentary blog post and ask clarifying questions.
UI accessibility concerns?
Added/updated tests?
- [x] No, and this is why: the tests already verify the changed behavior.
[Forem core team only] How will this change be communicated?
- [x] I will share this change internally with the appropriate teams
Top comments (2)
Nice!!!
Very interesting!