DEV Community: Jérôme Parent-Lévesque

How to safely rename STI models in Rails

Jérôme Parent-Lévesque — Fri, 18 Aug 2023 18:39:32 +0000

In Rails, Single Table Inheritance (STI) models store their full model name (including any module namespaces) in a type column. This column is used by ActiveRecord to determine which model to instantiate when loading a record from the database. This means that renaming such models isn't as easy as just changing the class name; it must also involve a data migration to update the values stored as type. However, how can we safely perform this in a live, production environment?

This is a challenge that we recently ran into at Potloc while working on modularization of our codebase. This involved namespacing all of our models under packs, which meant that STI models's type values also had to be updated.

Shopify Engineering posted last year a blog post about this same issue (albeit for Polymorphic models) in which they suggest to change entirely the nature of what is stored as type in the database. However, they mention that:

Our solution adds complexity. It’s probably not worth it for most use cases

And this was indeed how we felt for our use case. We wanted to perform this in a way that would have no impact on the way Rails works, and all while having zero downtime.

The Solution

Let's jump right in to the final solution for those who don't need all the details and just want a quick step-by-step guide!

In a first deployment;
- Rename the model to whatever you need
- Create, using the old model name, a new model that inherits from the renamed model but that is otherwise empty
- Remove all uses of the old model in the codebase
- Make sure that everywhere the type name was being used (whether as a raw string or through #sti_name), both the new and old type name are now supported
Migrate the data in the type column of all database records to reflect the new model name
In a final deployment, remove the deprecated classes and old type names used in the codebase

Step 1: Renaming the model

To help navigating through these steps, let's use a simple example:
Your team is currently modularizing the codebase and wants to create a new pack for their aerospace 🚀 division. You are therefore tasked to move an STI model named Rocket (say this model is under a base Vehicle model and vehicles database table) into a new namespace: Aerospace::Rocket.

You can start by renaming the model directly:

# models/aerospace/rocket.rb
module Aerospace
  class Rocket < Vehicle
    # ...
  end
end

Then, here comes the neat trick: We will create a sub-type of Aerospace::Rocket using the old model name:

# models/rocket.rb
class Rocket < Aerospace::Rocket; end

Notice that this model is completely empty. In fact, we shouldn't use it anywhere in the codebase (except for its #sti_name, we'll come back to that later).

This is not by accident. It turns out that ActiveRecord, under the hood, will use the sti_name of the current model, as well as the sti_name of any child models when querying records!
This means that by making the old model name inherit from the new one, we get for free the following behaviour:

Aerospace::Rocket.all.to_sql
# => SELECT * FROM vehicles WHERE type IN ('Aerospace::Rocket', 'Rocket');

This will therefore pave the way for us to then run a data migration that changes all Rocket types stored in the database to Aerospace::Rocket without breaking anything! 🎉
But before we do that, we have to take care of a couple more cases.

First, we want all new records created to use the new type name. This simply means replacing all uses of Rocket by Aerospace::Rocket in the codebase.

Second, if this model's #sti_name or its raw string ("Rocket") were used anywhere (for example in active record queries) we now have to make sure to support both the new and the old names.
In a typical ActiveRecord query, this might look something like this:

# From:
fleet.vehicles.where(type: Rocket.sti_name)
# To:
fleet.vehicles.where(type: [Aerospace::Rocket.sti_name, Rocket.sti_name])
# Or, better yet:
Aerospace::Rocket.where(fleets: fleet)

However, there might be other instances in your code where you might be using the #sti_name in a different way. You'll need to individually take a look at each of these. For example, since at Potloc we are using GraphQL and have some Enum types defined for STI models, we had to make sure that both possible type values would coerce to the same enum value that is sent back from the API.

Step 2: The data migration

That was the hard part! After step 1 is deployed, the rest is pretty much just business-as-usual when working in a continuous deployment environment.

In this step, we need to rename all old type names stored in the database to the new one. We can achieve this with a data migration (a good guide for this is the strong-migrations gem readme).
Note that this step may vary depending on your team's choice of how to run data migrations, but no matter the approach the following command (or equivalent) needs to be run in the production environment:

Vehicle.where(type: Rocket.sti_name).update_all(type: Aerospace::Rocket.sti_name)

Step 3: Cleanup

We should now be at a point where no records in the database are using the old sti_name anymore and any newly created records are all stored using the new name as type.

We can therefore cleanup everything!

First, we can remove the old Rocket model (the one that was empty and inherited from Aerospace::Rocket).
And finally, we can remove any special logic we added in Step 1 to support both Rocket.sti_name and Aerospace::Rocket.sti_name to now only support the latter.

And that's it! Migration complete! 🔥

Conclusion

It took a few steps, but by leveraging Rails' mechanism that fetches database records matching any of a model's children #sti_names, we were able to rename our Rocket model:

without any downtime, and;
without any changes to Rails' handling of STI models

Additionally, although this blog post didn't cover it, a similar process can also be used for renaming models used in Polymorphic associations. This might be the subject of a future article.

Hopefully this guide can help you to easily rename STI models, especially when it comes to modularization of your large Rails monoliths (something we can strongly recommend after a few months of trying packs-rails internally)!

Interested in what we do at Potloc? Come join us! We are hiring 🚀

Automatic "Ready for Review" Github Action

Jérôme Parent-Lévesque — Fri, 01 Apr 2022 18:40:47 +0000

TLDR: We wanted a GitHub Action to automatically assign reviewers and mark a draft pull request as "Ready for review" after our test suite passes. The final code can be found in this gist here.

At Potloc, our continuous integration process involves, among other things, a GitHub workflow running on each push that tests the code against our full test suite. This check must pass for a pull request to be merged.

Our test suite has gotten to a size where it is difficult to run on a personal computer in a reasonable amount of time, hence our developers usually rely on this GitHub workflow to run the full test suite.

The process looks something like this:

Push code for a new feature
Create a new Pull Request in "Draft" mode
Wait for all the tests to pass
Mark the Pull Request as "Ready for review" and assign reviewers

Note that we consider it a good practice to wait until tests pass before assigning reviewers in order to prevent notifying them only to realize that some more changes are necessary.

In practice, we have an in-house tool to help us automate most of these tasks through the GitHub CLI, but for a long time we didn't have a way to automatically mark a pull request as "Ready for review" when the all tests passed, meaning we had to wait and periodically check the status of each of our PR.

Inspired by Artur Dryomov's excellent post on Autonomous GitHub Pull Requests, we set out to create a GitHub Action to help us automate this.

Solution:

At the moment of creating the draft pull request, we want to be able to specify what to do in the event that all tests pass.

To achieve this, we will use a tag named autoready that we can put on our pull requests to signify that this PR should be automatically marked as "Ready for review" when all tests pass.

In addition, we want to be able to automatically assign reviewers when that happens. For that, we will be using a specific comment format that looks like this:

autoready-reviewers: reviewer1,reviewer2,organization/team1

Our workflow will automatically detect comments like this and assign each of the listed individual or team reviewers.

Implementation

GitHub Workflow configuration

Our workflow should run after each run of our Test workflow and use its output status to determine whether or not to mark the pull request as "Ready for review".
.github/workflows/ready_for_review.yml:



name: Ready For Review
on:
  workflow_run:
    workflows: ["Test"]
    branches-ignore: [main]
    types:
      - completed
jobs:
  mark_as_ready_for_review:
    runs-on: self-hosted
    if: ${{ github.event.workflow_run.conclusion == 'success' }}
    steps:
      - name: Checkout Code
        uses: actions/checkout@v3
      - name: Mark as Ready for Review
        run: bash .github/workflows/mark_as_ready_for_review.sh "${{ secrets.ACCESS_TOKEN }}" "${{ join(github.event.workflow_run.pull_requests.*.number) }}"

This will run our custom script mark_as_ready_for_review.sh after each successful run of the Test workflow.

Some noteworthy points:

We need the Checkout Code action to get the latest version of this mark_as_ready_for_review.sh script.
Our script takes a couple of arguments as input:
1. A GitHub access token of the "user" on behalf of whom we will be performing these automatic actions. In our case, we have a dedicated bot account for this. We store this value in a GitHub secret secrets.ACCESS_TOKEN.
2. A comma-separated list of all pull request IDs associated with this workflow run. Since a workflow run is attached to a particular commit hash, it is possible that multiple PRs have that same commit hash as HEAD.

Bash Script

Here is the script dissected and explained (scroll to the bottom for the full script):



#!/bin/bash
set -eou pipefail # Make sure we get useful error messages on failure

Our inputs and constants:



TOKEN="${1}"
PR_NUMBERS="${2}"
LABEL="autoready" # the name of the 'label' on the PR used to detect whether or not this script should run
REPO="your-repository" # the name of your repository on GitHub
ORGANIZATION="potloc" # the name of your GitHub organization or user to which the repository belongs

Then, we want to repeat the whole thing for as many pull requests as have been passed as input:



# Split the numbers string (comma-delimited)
for pr_number in $(echo $PR_NUMBERS | tr "," "\n"); do

Fetch the labels from the pull request. We will also need the Node ID of the PR to use GitHub's GraphQL API in a later step, so we also grab this at the same time.



# Get the node_id (and labels) from the PR number
# - https://docs.github.com/en/graphql/guides/using-global-node-ids
# - https://docs.github.com/en/rest/reference/pulls#get-a-pull-request
out=$(curl \
        --fail \
        --silent \
        --show-error \
        --header "Accept: application/vnd.github.v3+json" \
        --header "Authorization: token ${TOKEN}" \
        --request "GET" \
        --url "https://api.github.com/repos/${ORGANIZATION}/${REPO}/pulls/${pr_number}"
      )
node_id=$(jq -r '.node_id' <<< $out)
contains_label=$(jq "any(.labels[].name == \"${LABEL}\"; .)" <<< $out)
comments_url=$(jq -r ".comments_url" <<< $out)

# Check if the PR contains the label we want
if [ "$contains_label" == "true" ]; then
  # Continued below

Note that we use jq to simplify parsing of the JSON body returned by the GitHub API. This needs to be installed on the workers that will run this Workflow.

If the label exists on the PR, then we can mark is as "Ready for review". This API only exists in GitHub's GraphQL API, hence the different request. This is where we make use of the previously-retrieved node_id:



# Mark the PR as ready for review
curl \
  --fail \
  --silent \
  --show-error \
  --header "Content-Type: application/json" \
  --header "Authorization: token ${TOKEN}" \
  --request "POST" \
  --data "{ \"query\": \"mutation { markPullRequestReadyForReview(input: { pullRequestId: \\\"${node_id}\\\" }) { pullRequest { id } } }\" }" \
  --url https://api.github.com/graphql

Delete the label to prevent running this script for this PR:



# Remove the label
curl \
  --request "DELETE" \
  --header "Accept: application/vnd.github.v3+json" \
  --header "Authorization: token ${TOKEN}" \
  --url "https://api.github.com/repos/${ORGANIZATION}/${REPO}/issues/${pr_number}/labels/${LABEL}"

Finally, we want to find which reviewers to assign to this PR. To do this, we fetch all comments on the PR and use a regex to find a comment matching our autoready-reviewers: format we defined:



# Get the comments on the PR
comments_out=$(curl \
                --fail \
                --silent \
                --show-error \
                --header "Content-Type: application/vnd.github.v3+json" \
                --header "Authorization: token ${TOKEN}" \
                --request "GET" \
                --url $comments_url)

# Look for a comment matching the 'autoready-reviewers: ' pattern
# If found, assign the mentionned reviewers to review this PR
jq -r ".[].body" <<< $comments_out | while IFS='' read comment; do
  if [[ $comment =~ autoready-reviewers:[[:space:]]([a-zA-Z0-9,\-\/]+) ]]; then
    all_reviewers=${BASH_REMATCH[1]} # Get the first matching group of the regex (the comma-separated list of reviewers)

Using this list of reviewers, we differentiate between teams (e.g. potloc/devs) and individuals to assign by looking for the / character:



# Split the reviewers between teams and individuals
reviewers_array=()
team_reviewers_array=()
for reviewer in $(echo $all_reviewers | tr "," "\n"); do
  if [[ $reviewer =~ [a-zA-Z0-9,\-]+\/[a-zA-Z0-9,\-]+ ]]; then
    # In the case of a team reviewer, only take the part of the username after the '/':
    slug_array=(${reviewer//\// })
    team_slug=${slug_array[1]}
    team_reviewers_array+=("\"$team_slug\"")
  else
    reviewers_array+=("\"$reviewer\"")
  fi
done

# Join the array elements into a single comma-separated string:
reviewers=$(IFS=, ; echo "${reviewers_array[*]}")
team_reviewers=$(IFS=, ; echo "${team_reviewers_array[*]}")

The very last step is to make the API call to assign these individual and teams as reviewers to the PR:



# Assign reviewers

curl </span>

  --fail </span>

  --silent </span>

  --show-error </span>

  --output /dev/null </span>

  --header "Accept: application/vnd.github.v3+json" </span>

  --header "Authorization: token ${TOKEN}" </span>

  --request "POST" </span>

  --url "https://api.github.com/repos/${ORGANIZATION}/${REPO}/pulls/${pr_number}/requested_reviewers" </span>

  --data "{\"reviewers\":[${reviewers}], \"team_reviewers\":[${team_reviewers}]}"

Conclusion

And that is it! Now, to use this tool we can put the autoready label on a draft pull request and write a comment in the form autoready-reviewers: reviewer1,reviewer2,organization/team1.

In practice, at Potloc, we have a little helper in-house tool do these steps for us using the GitHub CLI and tty-prompt to
ease the selection of reviewers/teams and the formatting of this comment.

And this is what it looks like on GitHub's interface!

Interested in what we do at Potloc? Come join us! We are hiring 🚀

Full code:
https://gist.github.com/jeromepl/02e70f3ea4a4e8103da6f96f14eb213c

Optimally Taking Out Extra Survey Respondents

Jérôme Parent-Lévesque — Tue, 05 Oct 2021 17:54:34 +0000

Sometimes when analysing the results of a survey, one needs to remove some respondents from their sample. This is something we do fairly commonly at Potloc in order to obtain a more representative sample of the target population in our surveys. In other words, we use this as a way of performing stratified sampling.

We use a system of quotas to keep track of every agreement on the respondent sample we make with our clients. We have three types of quotas, each corresponding to a different way of assessing whether their target is met or not.

To match targets exactly (like we want to achieve for stratified sampling) we use one of these quota types - the strict quota type. For example, our clients might want exactly 50 respondents who work as electricians. No matter whether we have one respondent missing or one more than 50 in this category, our quota is not achieved.

The second type of quota we use is the minimum type. This type, as the name suggests, simply indicates that we must have at least as many respondents of a specific category as the target number.

The third and final type of quota is the weighted type. As we often use a weighting process to obtain a more representative sample of our population, we make sure to communicate with our clients where survey responses may be weighted. This communication in turn gets converted into quotas of type weighted which behave similarly to minimum quotas, but with more flexibility. The targets don't need to be matched exactly and will instead be achieved through an independent weighting process (don't worry, this will be explained in more details in Step 2 below). The "minimum" for this type of quota is (arbitrarily) set to 50% of the target as a way to limit the scale of the weights (this way weights should rarely be more than 2).

The image above shows an example of a combination of quotas we could have. Here, we want to end up with a minimum of one respondent who has a cat, exactly one respondent who is a doctor, and we want after weighting to have one effective respondent whose name contain the letter 'a' and two effective respondents whose name is shorter than 7 letters.

Imagine now that we have received the following responses to our survey:

In this initial state, we have:

1 respondent who has a cat (Alice)
2 doctors (Bob and Catherine)
2 respondents whose name contains the letter 'a' (Alice and Catherine)
2 respondents whose name is shorter than 7 letters (Alice and Bob)

The minimum quota is therefore satisfied (but Alice cannot be removed without breaking it) and there is one too many doctor. For weighted quotas, we always have at least 50% of the target number of respondents. As we will see later, the weighted quotas will be useful in determining the optimal respondents to take out.

We will keep referring to these quotas and respondents throughout this article to provide a practical example of how we select respondents to take out.

We set out to find the optimal selection of respondents to take out given a set of quotas such as this one. Below is the full step-by-step explanation of the algorithm we use to perform this and an example of how it is applied to this fictional set of quotas and respondents.

Step 1

We first identified that by determining which respondents belonged to which quotas, we could split the respondents into 3 different categories:

Respondents that cannot be taken out are respondents that belong to quotas for which the target is not exceeded. For example, our minimum quota "has a cat" has a target of 1 and only Alice fits into this category. Therefore, Alice cannot be taken out as otherwise the "has a cat" quota would be broken. The same goes for quotas with fewer respondents than the target, for example if the target was to have 2 respondents who own a cat.

Respondents that should be taken out as a priority are respondents that belong specifically to a strict quota for which the target is exceeded. Since for this type of quota we want to end up with exactly the target number of respondents, we have to take out respondents belonging to this quota until that target is matched. This group takes priority over the Respondents that cannot be taken out as we prioritise taking out respondents in exceeded strict quotas until those quotas are satisfied.

Respondents that may be taken out are all remaining respondents. These respondents may or may not belong to any quota. If they do, then that quota's target has to be exceeded — otherwise they would be in the Respondents that cannot be taken out category. Note that these respondents logically cannot belong to any strict quota since those belonging to this type of quota must fit in one of the first 2 categories.

Going back to our example, our three respondents would belong into the following groups:

The respondents that should be taken out as a priority group includes both doctors (Bob and Catherine) as there is one too many to satisfy the strict quota. Alice cannot be taken out because she is the only respondent who has a cat. The last group is empty as all respondents already belong to other groups.

Step 2

Now that we have a categorisation of each respondent, we are almost ready to start taking out respondents. However, since our objective is to optimally take out respondents, we need to compute one more piece of data related to weighted-type quotas.

First, we need to define a bit better what we mean by optimally here.
Our survey results are usually calculated on weighted data in order to better match the target population demographics.

In other words, as part of our survey workflow, we compute a weight for each respondent and use it as a multiplicative factor to scale the "importance" of each survey response. This is a process called weighting.

The weights can be interpreted as a measure of the quality of our respondents sample by looking at their distribution. The further away from the value of 1 the weights are, the worse the quality. Indeed, a small weight indicates that we have too many similar respondents and a large weight indicates that we are missing respondents with similar characteristics.
For more details on the weighting process I invite you to read my previous blog post on Generalized Weighting.

Thus, when taking out extra respondents, we would like to ensure that our weighting quality will be unaffected. This is the key to our notion of optimality — we want not only to satisfy all quotas but also to obtain the highest possible weighting quality as a result.

To achieve this, we compute weights for each respondent based on the targets of the weighted quotas. Using a raked weighting algorithm, we use all weighted quota numbers as "targets" to obtain respondent weights.

In our example, we obtain the following weights by using the targets from the two weighted quotas:

Notice that by multiplying the respondent's (numerical) answer the weighted quota targets are matched perfectly! The count of respondents whose name contains the letter 'a' becomes 1 (from 2) as it is now the sum of 0.59 and 0.41. Meanwhile, the count of respondents whose name is shorter than 7 letters stays 2, although the weight of each respondent differs.

In the next step, we will be removing respondents with the smallest weights first whenever we cannot decide who to take out!

Step 3

Our respondents now belong to one of the 3 categories presented in Step 1 and each have a weight resulting from the raked weighting computation from Step 2. It is now time to start taking out respondents.

The core of the strategy here is to take out respondents one-by-one. After each respondent that is taken out, our quotas and weighting need to be updated, meaning that steps 1 and 2 need to be performed again! We therefore perform this step in a loop where in each iteration we recompute the first 2 steps before choosing and taking out 1 respondent.

This respondent is chosen according to the given priority list:

Select a pool of respondents to pick from:
- If there are any respondents in the Respondents that should be taken out as a priority category, then limit our selection to this group only
- Otherwise, if there are any Respondents that may be taken out, select this group
From this pool, select the optimal respondent to be taken out:
- The optimal respondent corresponds to the respondent with the smallest weight, since a small weight indicates that we have many similar respondents
- In the case of a tie, or if there are no weighted quotas, we remove the last respondent to have answered the survey. (Note: the statistically correct thing to do here would be to remove a random respondent from the pool, but we choose this approach as it is idempotent — we can re-run the algorithm and the selected respondents will be the same. Additionally, this replicates the behaviour of traditional sampling tools that have quota-stops)
Take out the selected respondent and repeat from Step 1 until all strict quotas are met!

Here's how this would play out in our fictional example:

We have 2 respondents in the Respondents that should be taken out as a priority category (Bob and Catherine) and thus only these respondents are taken into consideration
To determine who to remove from these two respondents, we take a look at their data:

Since Catherine has the smallest weight (0.41 vs. 1.41), she is taken out. Intuitively, this makes sense as we had one too many respondent whose name contained the letter 'a' to satisfy the weighted quota without even applying a weighting.

We are now left with one respondent whose name contains the letter 'a' and two respondents whose name is shorter than 7 letters, meaning that our final weights will be exactly 1 — the optimal value for weights!

Additionally, now that respondent "Catherine" has been taken out, all of our quotas are satisfied and we can stop the algorithm here.

Conclusion

Using this process, we are able to remove respondents so as to match our quotas as best as we can, while also leading to a better survey weighting. Indeed, since we always remove respondents with the smallest weights, our weighting gets progressively better as the minimum weight gets closer and closer to 1 (the optimal value). This means that the final data presented for this survey — after the weighting step — will be more representative of the target population, a win for both Potloc and our clients!

Interested in what we do at Potloc? Come join us! We are hiring 🚀

Appendix - The case of multiple overlapping `strict` quotas

We might sometimes have respondents that correspond to multiple different strict quotas. In this scenario, it is more complicated to select the optimal respondents to take out as it is not always obvious what is the smallest possible set of respondents that need to be removed in order to satisfy all such quotas. It is, for example, possible to have a respondent (let's call them respondent A) that we take out since it belongs to a strict quota which have both exceeded their target. However, it is possible to then take out other respondents which also correspond to this quota because they also belong to another strict quota which was exceeded. This could now break the first quota if it had met its target exactly. In this scenario, we end up with a respondent (respondent A) which can now be reinstated as the strict quota it belonged to is now under its target.

To alleviate this problem while avoiding a complicated and expensive decision process solutions from the field of operational research, we employ two mechanisms.

First, we try to more optimally pick which strict quota respondents to take out first. To do this, we also consider the following factors:

Whether the respondent can be disqualified or not (if all of its quotas are exceeded)
The number of exceeded strict quotas a respondent is a part of (more = higher priority)
The total number of strict quotas a respondent is a part of (fewer = higher priority)
- This is used to minimise the impact of taking out respondents on other quotas
The minimum difference between the current count and the target count of strict quotas that are exceeding their target (bigger = higher priority, as there is more room to remove respondents)

Second, we add a final step at the end of the process in which we restore respondents that can be without breaking any quota. This solves the issue highlighted in the example above.

Using these two approximations we are able to get a result that is close to optimal, for a minimal cost.

Generalized Raking for Survey Weighting

Jérôme Parent-Lévesque — Tue, 29 Jun 2021 15:28:44 +0000

In the world of surveys, it is very common that our acquired responses need to be weighted in order to achieve a sample that is representative of some target population. This process of weighting simply consists of assigning a weight (a.k.a. factor) to each respondent, and calculating all survey results as a weighted sum of respondents.

For example, we might have surveyed 100 male respondents and 150 female respondents but were targeting a male / female ratio of 48% / 52%. In this simple case, we could achieve the target ratio by weighting the male responses by a factor of 0.48 / (100 / (100 + 150)) = 1.2 and weighting the female responses by 0.52 / (150 / (100 + 150) = 0.867.
The technical term for this method of computing weights is Post-Stratification.

However, in a more complex scenario, where we have many different measurable demographic targets, how can we determine weights for all the survey respondents?

Raking

At Potloc, it is very common that our clients desire survey populations matching a lot of such targets. For example, we might have targets looking like this:

42% male
58% female
20% students
80% non-students
15% dog owners
...

In this setting, weights cannot be calculated using a simple ratio as in the male/female example shown above. Here, we instead need to rely on more involved algorithms, notably a process called Raking.

Iterative Proportional Fitting

One common approach to solve the problem of finding good weights that will satisfy our demographic targets is Iterative Proportional Fitting. Typically in the industry, when the term "raking" is used it refers to this algorithm. In this method, weights for each respondents are computed for a single target at a time using Post-Stratification. By iteratively computing this for each target and repeating a few times, the weights end up converging to values that satisfy our targets.

Great! Problem solved!

...but what if we could do even better? 🤔

Generalized Raking

Beyond satisfying the demographic targets, the most desirable property for the weights is that they should be as close as possible to 1. Indeed, weights that are really large mean that those respondents' responses will count for a lot more than the "average" respondent in our survey results. For example, a respondent with a weight of 10 will count for 10 times more than the average respondent, and 100 times more than a respondent with weight 0.1 . Similarly, small weights mean that some responses will have very little impact on the final results.

Unfortunately, Iterative Proportional Fitting does nothing to encourage weights to be close to 1, which leads to sub-optimal weights. This is where Generalized Raking, an algorithm introduced by Deville et al. (1992), comes into play.

Note: This is where we get into the more mathematical part of this blog post 🤓. Don't care about this part? No worries! Simply skip to the next section!

The authors of this paper formulated the weighting problem as a constrained optimization method where the objective is that the weights are as close to one as possible and where the constraint is that the targets are matched. Mathematically this looks like this:

w arg min G (w) s.t. X^{T} w = T G (x) = x (lo g (x) - 1) + 1

where $G (x)$ is the raking function which encourages weights to be close to 1, $w$ is the vector of weights, $T$ is the vector of targets (in absolute numbers, not percentages) and $X$ is the $(numRespondents \times numTargets)$ matrix of responses. The matrix $X$ is binary where cells are filled with a '1' if the respondent belongs to the target category and '0' otherwise.

In other words, this is saying that we want to optimize the weights to be as close to 1 as possible while satisfying the target constraints. This is achieved by minimizing the function $G (x)$ which looks like this (notice the global minimum at $x = 1$ !):

The Generalized Raking Algorithm

While it is possible to solve this optimization problem using general methods such as Sequential Least Squares Programming, the authors of Generalized Raking have devised a more efficient and robust algorithm for this specific problem:

Initialize variables
- A $(numRespondents \times 1)$ vector $w$ to ones
- A $(numTargets \times 1)$ vector $λ$ to zeros
- A $(numRespondents \times numRespondents)$ square matrix $H$ to the Identity matrix
While the weights have not converged, repeat:
1. $λ = λ + (X^{T} H X)^{- 1} (T - X^{T} w)$
2. $w = G^{- 1} (X λ)$
3. $H = diag (G^{- 1}^{'} (X λ))$

Here, $G^{- 1} (x)$ is the inverse of the derivative of the raking function, i.e. $e^{x}$ .
$G^{- 1}^{'}$ is its derivative, in this case also $e^{x}$ .

The final value of $w$ corresponds to the weighting factors we are looking for!

Implementation

While there are many implementations of this algorithm in R, we were not able to find one in Ruby that could play well with our codebase and be easily maintainable.
We therefore decided to make our own and to share it here for anyone looking for something similar. We started by making an implementation in python with the popular numpy library:

import numpy as np

def raking_inverse(x):
  return np.exp(x)

def d_raking_inverse(x):
  return np.exp(x)

def graking(X, T, max_steps=500, tolerance=1e-6):
  # Based on algo in (Deville et al., 1992) explained in detail on page 37 in
  # https://orca.cf.ac.uk/109727/1/2018daviesgpphd.pdf

  # Initialize variables - Step 1
  n, m = X.shape
  L = np.zeros(m) # Lagrange multipliers (lambda)
  w = np.ones(n) # Our weights (will get progressively updated)
  H = np.eye(n)
  success = False

  for step in range(max_steps):
    L += np.dot(np.linalg.pinv(np.dot(np.dot(X.T, H), X)), (T - np.dot(X.T, w))) # Step 2.1
    w = raking_inverse(np.dot(X, L)) # Step 2.2
    H = np.diag(d_raking_inverse(np.dot(X, L))) # Step 2.3

    # Termination condition:
    loss = np.max(np.abs(np.dot(X.T, w) - T) / T)
    if loss < tolerance:
        success = True
        break

  if not success: raise Exception("Did not converge")
  return w

Ruby Implementation

After validating the algorithm in python, we then proceeded to replicate it in Ruby. For this, we had to find an equivalent to numpy which we found in Numo. Numo is an awesome library for vector and matrix operations, and its linalg sub-library was perfect for us as we needed to compute a matrix pseudo-inverse. This allowed us to translate the code to Ruby almost line by line:

require "numo/narray"
require "numo/linalg"

def raking_inverse(x)
  Numo::NMath.exp(x)
end

def d_raking_inverse(x)
  Numo::NMath.exp(x)
end

def graking(X, T, max_steps: 500, tolerance: 1e-6)
  # Based on algo in (Deville et al., 1992) explained in detail on page 37 in
  # https://orca.cf.ac.uk/109727/1/2018daviesgpphd.pdf

  # Initialize variables - Step 1
  n, m = X.shape
  L = Numo::DFloat.zeros(m)
  w = Numo::DFloat.ones(n)
  H_diag = Numo::DFloat.ones(n)

  success = false

  max_steps.times do
    L += Numo::Linalg.pinv((X.transpose * H_diag).dot(X)).dot(T - X.transpose.dot(w)) # Step 2.1
    w = raking_inverse(X.dot(L)) # Step 2.2
    H_diag = d_raking_inverse(X.dot(L)) # Step 2.3

    # Termination condition:
    loss = ((T - X.transpose.dot(w)).abs / T).max
    if loss < tolerance
      success = true
      break
    end
  end

  raise StandardError, "Did not converged" unless success
  w
end

You may have noticed that the code doesn't quite match exactly the algorithm described above, notably steps 2.1 and 2.3. This is because we have found it to be vastly faster with Numo to store the sparse matrix $H$ as a flat vector h_matrix_diagonal since it only contains values on the diagonal. As a result, the step of taking the product $X^{T} H$ can be rewritten as X.Transpose * h_matrix_diagonal, making use of Numo's implicit broadcasting.

In practice, we optimize this code a bit further by exiting early whenever possible (for example if our loss becomes NaN) and by allowing to pass as input an initial value for the vector lambdas if we believe to have an initialisation value better than the default.

With these few lines of code, we are now able to support complex survey weighting scenarios while having all of our code in our beautiful Ruby monolith 🎉

Interested in what we do at Potloc? Come join us! We are hiring 🚀

DEV Community: Jérôme Parent-Lévesque

How to safely rename STI models in Rails

The Solution

Step 1: Renaming the model

Step 2: The data migration

Step 3: Cleanup

Conclusion

Automatic "Ready for Review" Github Action

Solution:

Implementation

GitHub Workflow configuration

Bash Script

Conclusion

Optimally Taking Out Extra Survey Respondents

Step 1

Step 2

Step 3

Conclusion

Appendix - The case of multiple overlapping strict quotas

Generalized Raking for Survey Weighting

Raking

Iterative Proportional Fitting

Generalized Raking

The Generalized Raking Algorithm

Implementation

Ruby Implementation

References

Appendix - The case of multiple overlapping `strict` quotas