DEV Community: Potloc

How to safely rename STI models in Rails

Jérôme Parent-Lévesque — Fri, 18 Aug 2023 18:39:32 +0000

In Rails, Single Table Inheritance (STI) models store their full model name (including any module namespaces) in a type column. This column is used by ActiveRecord to determine which model to instantiate when loading a record from the database. This means that renaming such models isn't as easy as just changing the class name; it must also involve a data migration to update the values stored as type. However, how can we safely perform this in a live, production environment?

This is a challenge that we recently ran into at Potloc while working on modularization of our codebase. This involved namespacing all of our models under packs, which meant that STI models's type values also had to be updated.

Shopify Engineering posted last year a blog post about this same issue (albeit for Polymorphic models) in which they suggest to change entirely the nature of what is stored as type in the database. However, they mention that:

Our solution adds complexity. It’s probably not worth it for most use cases

And this was indeed how we felt for our use case. We wanted to perform this in a way that would have no impact on the way Rails works, and all while having zero downtime.

The Solution

Let's jump right in to the final solution for those who don't need all the details and just want a quick step-by-step guide!

In a first deployment;
- Rename the model to whatever you need
- Create, using the old model name, a new model that inherits from the renamed model but that is otherwise empty
- Remove all uses of the old model in the codebase
- Make sure that everywhere the type name was being used (whether as a raw string or through #sti_name), both the new and old type name are now supported
Migrate the data in the type column of all database records to reflect the new model name
In a final deployment, remove the deprecated classes and old type names used in the codebase

Step 1: Renaming the model

To help navigating through these steps, let's use a simple example:
Your team is currently modularizing the codebase and wants to create a new pack for their aerospace 🚀 division. You are therefore tasked to move an STI model named Rocket (say this model is under a base Vehicle model and vehicles database table) into a new namespace: Aerospace::Rocket.

You can start by renaming the model directly:

# models/aerospace/rocket.rb
module Aerospace
  class Rocket < Vehicle
    # ...
  end
end

Then, here comes the neat trick: We will create a sub-type of Aerospace::Rocket using the old model name:

# models/rocket.rb
class Rocket < Aerospace::Rocket; end

Notice that this model is completely empty. In fact, we shouldn't use it anywhere in the codebase (except for its #sti_name, we'll come back to that later).

This is not by accident. It turns out that ActiveRecord, under the hood, will use the sti_name of the current model, as well as the sti_name of any child models when querying records!
This means that by making the old model name inherit from the new one, we get for free the following behaviour:

Aerospace::Rocket.all.to_sql
# => SELECT * FROM vehicles WHERE type IN ('Aerospace::Rocket', 'Rocket');

This will therefore pave the way for us to then run a data migration that changes all Rocket types stored in the database to Aerospace::Rocket without breaking anything! 🎉
But before we do that, we have to take care of a couple more cases.

First, we want all new records created to use the new type name. This simply means replacing all uses of Rocket by Aerospace::Rocket in the codebase.

Second, if this model's #sti_name or its raw string ("Rocket") were used anywhere (for example in active record queries) we now have to make sure to support both the new and the old names.
In a typical ActiveRecord query, this might look something like this:

# From:
fleet.vehicles.where(type: Rocket.sti_name)
# To:
fleet.vehicles.where(type: [Aerospace::Rocket.sti_name, Rocket.sti_name])
# Or, better yet:
Aerospace::Rocket.where(fleets: fleet)

However, there might be other instances in your code where you might be using the #sti_name in a different way. You'll need to individually take a look at each of these. For example, since at Potloc we are using GraphQL and have some Enum types defined for STI models, we had to make sure that both possible type values would coerce to the same enum value that is sent back from the API.

Step 2: The data migration

That was the hard part! After step 1 is deployed, the rest is pretty much just business-as-usual when working in a continuous deployment environment.

In this step, we need to rename all old type names stored in the database to the new one. We can achieve this with a data migration (a good guide for this is the strong-migrations gem readme).
Note that this step may vary depending on your team's choice of how to run data migrations, but no matter the approach the following command (or equivalent) needs to be run in the production environment:

Vehicle.where(type: Rocket.sti_name).update_all(type: Aerospace::Rocket.sti_name)

Step 3: Cleanup

We should now be at a point where no records in the database are using the old sti_name anymore and any newly created records are all stored using the new name as type.

We can therefore cleanup everything!

First, we can remove the old Rocket model (the one that was empty and inherited from Aerospace::Rocket).
And finally, we can remove any special logic we added in Step 1 to support both Rocket.sti_name and Aerospace::Rocket.sti_name to now only support the latter.

And that's it! Migration complete! 🔥

Conclusion

It took a few steps, but by leveraging Rails' mechanism that fetches database records matching any of a model's children #sti_names, we were able to rename our Rocket model:

without any downtime, and;
without any changes to Rails' handling of STI models

Additionally, although this blog post didn't cover it, a similar process can also be used for renaming models used in Polymorphic associations. This might be the subject of a future article.

Hopefully this guide can help you to easily rename STI models, especially when it comes to modularization of your large Rails monoliths (something we can strongly recommend after a few months of trying packs-rails internally)!

Interested in what we do at Potloc? Come join us! We are hiring 🚀

Data Analytics at Potloc I: Making data integrity your priority with Elementary & Meltano

Stéphane Burwash — Fri, 06 Jan 2023 03:28:33 +0000

Foreword

This is the first of a series of small blog posts where we describe plugins that our data engineering team at Potloc developed in order to solve business issues that we were facing.

These features were developed to enhance our current stack, which consists of:

Meltano as our DataOps platform + Extract / Load tool
dbt as our data transformation tool
Airflow
Bigquery as our data warehouse
AWS Fargate as our hosting infrastructure, managed through terraform

In this article, it is assumed that the reader has basic knowledge of:

Meltano
dbt

We hope that you enjoy the article! If you have any questions, feel free to reach out.

This article was not sponsored by Elementary in any way, we're just big fans 😉.

Data Integrity - It's more than a buzzword

Data integrity is an essential component of a data pipeline. Without integrity and trust, your pipeline is essentially worthless, and those hundreds of hours you spent integrating new sources, optimizing load times, and modeling raw data into usable insights go down the drain. More than once, our data team has created "production ready" dashboards, only to realise that the integrity / buisness logic behind the dashboard was completely flawed. What was supposed to be a 2 day project became a 3 week debacle.

To circumvent falling into the data quality trap, you can write tests using a number of powerful open-source data integrity solutions such as Great Expectations, Soda Core or even natively using dbt tests to validate that your data is doing what it's supposed to. But as you write more and more tests, you can start running into some issues, mainly:

Tracking over time: How do you keep track of test results over time? How are you progressing in terms of tackling these issues?

This was the starting point for our quest to find a long-term data integrity solution at Potloc. We were attempting to map integrity issues in user-inputted data. We also wanted to give our team an accurate report of their progress as they resolved these issues one-by-one in the source data.

Readability: As you go from 5 to 50 to 500 to 5000+ tests, reading results becomes exponentially more complicated and time-intensive.

Reproducibility: Once you have detected an integrity issue, how do you quickly reproduce the test to be able to investigate?

Unknown unknowns: While it is possible to test for every known possible issue, it can be harder / impossible to test for unknown unknowns such as dataset shifts, anomalies, large spikes in row count, etc.

These issues can be circumvented by integrating Elementary into your workflow.

Features

Elementary is an open-source tool for data observability and validation that wraps around an existing dbt project. It allows users to graduate from simply having integrity tests to using them in order to improve confidence in your data product.

Here are only some of the reasons why I personally love Elementary:

1. The UI, the glorious UI:

Elementary and its associated CLI (command line interface) edr natively allow you to generate a static HTML file containing test results. This file can either be viewed locally, sent through slack or even hosted in a cloud service to be viewed as a webpage.

From this UI, you can view your most recent test results, historical test results, check model run times and even view lineage graphs.

If you have any failures in tests run, you can view samples of offending entries or copy the SQL query that generated these errors to quickly investigate.

You can play around with Elementary's demo project to get a feel for it.

2. Stored results and historical views

Elementary integrates with your dbt project in order to store all test runs and uses on-run-end hooks to upload results. This all happens on the dbt package, without need to connect to the data warehouse.

This allows us to view test results over time, view progress from run to run, and use test results for internal reporting.

It can sometimes be hard to share integrity metrics with the rest of your non-data-literate team. Having easy access to results & metrics such as row count directly in your warehouse allows you to create integrity dashboards curated for business use cases, giving your team the opportunity to start tackling issues and take stock of their progress.

3. Anomaly detection tests

As mentioned above, it is hard to deal with unknown unknowns and issues that arise over longer periods of time (days, weeks, or even months). Even if your data is clean when your model first goes into production, it does not mean that mistakes/issues cannot slip in as time goes on. A supported table requires constant monitoring, especially if business logic has been hard-coded.

This task can be greatly alleviated by making use of Elementary's native anomaly detection tests, which monitor for shifts at the table and column level for metrics such as:

Row count
Null count
Percentage of missing
Freshness
Etc.

A full list of all anomaly metrics Elementary tests for can be found here.

By basing itself on past results rather than hard-coded baselines (ex: an increase of 10% in row count or half-day delay in freshness), Elementary can easily be integrated out-of-the-box without needing to fine-tune from pipeline to pipeline.

At Potloc, we mainly use this feature to identify freshness issues. Elementary allows you to easily setup freshness checks without having to specify hard deadlines (ex: 12h since last sync, 24h since last updated, etc.). This means that we can change our upstream Extract/Load job schedules without having to change our downstream tests; the tool will automatically flag the issue, and then adapt to the new schedule as it becomes norm. This also makes the intial setup is quick and painless.

Integrating Elementary into your existing Meltano Project

We developed an Elementary plugin for Meltano using the Meltano EDK that can easily be integrated into your project.

To add it, simply run:

meltano add utiliy elementary

This should add the following code snippet to your meltano.yml file:

  - name: elementary
    variant: elementary
    pip_url: elementary-data==<EDR VERSION> git+https://github.com/potloc/elementary-ext.git

You will also need to add the following snippet to your packages.yml file:

packages:
  - package: elementary-data/elementary
    version: <DBT PACKAGE VERSION>
    ## Docs: <https://docs.elementary-data.com>

As you can see, we have 2 elements we now need to complete:

EDR version
dbt package version

Both of the can be found in the elementary quickstart documentation or in their respective package indexes (pypi and dbt packages).

It is important that both of these versions are aligned in accordance with creator releases. If CLI and dbt package versions are misaligned, errors can ensue.

At the time of writing this article, we would be using EDR VERSION = 0.63 & DBT PACKAGE VERSION = 0.66

Note: We will be working on making this process easier so that you do not need to specify package versions.*

Next, you need to set all of your environment variables for Elementary so that they use the same as your existing dbt project. A typical setup could look like this:

      - name: elementary
        namespace: elementary
        pip_url: elementary-data[platform]==0.6.3 git+https://github.com/potloc/elementary-ext.git
        executable: elementary_invoker
        settings:
        - name: project_dir
          kind: string
          value: ${MELTANO_PROJECT_ROOT}/transform/
        - name: profiles_dir
          kind: string
          value: ${MELTANO_PROJECT_ROOT}/transform/profiles/platform/
        - name: file_path
          kind: string
          value: ${MELTANO_PROJECT_ROOT}/path/to/report.html
        - name: skip_pre_invoke
          env: ELEMENTARY_EXT_SKIP_PRE_INVOKE
          kind: boolean
          value: true
          description: Whether to skip pre-invoke hooks which automatically run dbt clean and deps
        - name: slack-token
          kind: password
        - name: slack-channel-name
          kind: string
          value: elementary-notifs
        config:
          profiles-dir: ${MELTANO_PROJECT_ROOT}/transform/profiles/platform/
          file-path: ${MELTANO_PROJECT_ROOT}/path/to/report.html
          slack-channel-name: your_channel_name
          skip_pre_invoke: true
        commands:

          initialize:
            args: initialize
            executable: elementary_extension
          describe:
            args: describe
            executable: elementary_extension
          monitor-report:
            args: monitor-report
            executable: elementary_extension
          monitor-send-report:
            args: monitor-send-report
            executable: elementary_extension

Make sure to specify the platform, which should be specified in your profile (we use bigquery).

After this, simply follow the instructions in the Elementary Quickstart Guide to get the plugin up and running.

Generating your first report

Once you have got elementary up and running, it's time to generate your first report. Simply run the command

meltano invoke elementary:monitor-report

and a brand new report should be generated at the specified file-path (${MELTANO_PROJECT_ROOT}/path/to/report.html in our case).

Next steps

Once you've generated your first report, the sky is the limit in terms of integrating elementary to your workflow.

The Elementary team has made it incredibly easy to send a report as a slack message. At Potloc, we receive reports twice a day to monitor the state of our pipeline.

You can also set up hosting for your report on s3 or send slack alerts when an error occurs. Experiment and find what best works for you!

A quick closing statement

While incredibly powerful, Elementary is not a replacement for best practices.
When writing tests, ensure that they are pertinent and targeted.
Tests should be written to identify data integrity issues that can compromise business insights, not simply to identify null values.
If you write too many tests without thinking of the meaning behind them, you run the risk of falling into the "too many errors = no errors" paradigm where you have so many warnings that it's impossible to differentiate between actual issues and unfixable noise.

I speak from experience; at one point, we had multiple tests in our pipeline that returned a warning of over 5 000 erroneous values, with one going up to 180 000. These errors were unactionable, and yet the tests remained. Even with Elementary, this made it hard for us to differentiate between useless warnings and actual integrity issues that needed to be resolved.

Make sure to reach out to the Elementary team if you have any questions about their product!

Interested in what we do at Potloc? Come join us! We are hiring 🚀

How to optimize factory creation.

Clément Morisset — Wed, 21 Dec 2022 20:24:45 +0000

At Potloc we have a test stack which is pretty standard in the Rails ecosystem. We run tests with RSpec, we use FactoryBot for setting up our test data, Capybara for user interactions, Github Actions as a CI etc..

These great tools allow us to code at a fast pace with good test coverage. But this pace comes at a cost. The more the team grows, the bigger the codebase gets and the more tests get written.

🤕 The issue

As developer we usually take care of optimizing our own code and queries, we are used to test new implementations with all the edge cases. But tests optimization is likely a topic that we put at the bottom of the list and that’s if we ever even think about it.
Up until the moment where quietly but surely you will end up with a CI that take ages to run and a whole test suite to speed up.

Let’s see how we took on this challenge at Potloc.

For the purpose of demonstration we will rely on this simple test file:

RSpec.describe Purging::QuestionnaireWorker, type: :worker do
  let(:questionnaire) { create(:questionnaire) }

  describe "#perform" do
    it "destroys a survey form" do
      expect { subject.perform(questionnaire.id) }.to change(Questionnaire, :count).by(-1)
    end

    context "given associations" do
      it "destroys a questionnaire and its associations" do
        create(:question, :postal_code, questionnaire: questionnaire)

        expect { subject.perform(questionnaire.id) }.to change(SurveyQuestion, :count).by(-1)
      end
    end
  end
end

🧑‍🚒 The solutions

The factory-bot gem is used in almost in all of our spec files and it make our set up much more easier than when we use fixtures.
Here is the tradeoff, the easier the gem is to use, the more likely you’ll end up with some pain to control its usage. And when the times come to tackle slow tests, the best bet you can take is to start digging into you factories because it’s likely they are the primary reason why your test suite is slowing down

Avoiding the factory cascades

To quote Evil Martian a factory cascade is an

uncontrollable process of generating excess data through nested factory invocations.

In our test we use two factories a questionnaire and question who could be represented like this as a tree:

questionnaire
|
|---- survey

question
|
|---- survey

In this simple example each factory calls a nested factory. This means that every time we create a questionnaire factory, we also create a survey factory.

To have a better vision of what objects are created in our spec file we can use test-prof, a powerful gem that provides a collection of different tools to analyse your test suite performance. One of this tool is really useful to identify a factory cascade, let’s introduce factory profiler.

If we run FPROF=1 bundle exec rspec the factory profiler, it will generate the following report:

[TEST PROF INFO] Factories usage

 Total: 6
 Total top-level: 3
 Total time: 00:01.005 (out of 00:26.087)
 Total uniq factories: 3

   total   top-level     total time      time per call      top-level time               name

       3           0        0.6911s            0.2304s             0.0000s             survey
       2           2        0.6397s            0.3199s             0.6397s             questionnaire
       1           1        0.3658s            0.3658s             0.3658s             question

The most interesting insight is the difference between the total and the top-level. The more this difference is important, the more you end up with factory cascade, meaning that you are creating useless factories.

Let’s take the survey , we don’t instantiate any surveys in our test suite but during the execution we create 4.

The easiest workaround it is to instantiate a survey at the top-level and to associate the factories to it.

RSpec.describe Purging::QuestionnaireWorker, type: :worker do
  let(:survey) { create(:survey) }
  let(:questionnaire) { create(:questionnaire, survey: survey) }
  ....

  it "destroys a questionnaire and its associations" do
    create(:question, :postal_code, questionnaire: questionnaire, survey: survey)
    ....

If we run the factory profiler we now have a different report:

[TEST PROF INFO] Factories usage

 Total: 5
 Total top-level: 5
 Total time: 00:00.868 (out of 01:09.067)
 Total uniq factories: 3

   total   top-level     total time      time per call      top-level time               name

       2           2        0.6986s            0.3493s             0.6986s             survey
       2           2        0.0973s            0.0487s             0.0973s             questionnaire
       1           1        0.0729s            0.0729s             0.0729s             question

Nice! No more factory cascades. The total and the top-level columns are the same. We now create 5 factories instead of 8. We have decreased the time spent creating factories by 30%.

The caveat of this method it that it could be a heavy process to maintain. Thankfully, test-prof as a recipe called FactoryDefault. Removing factory cascades manually could be good enough most of the time but if you want to go further you can follow the documentation.

That being said, test-prof has even more to offer, it’s time to introduce an awesome helper named let_it_be.

Reuse the factory you need

Let's bring a little bit of magic and introduce a new way to set up a shared test data.

let_it_be is a helper that allows you to reuse the same factory for all your spec file. In our example we don’t need to create 2 survey and 2 questionnaire we could re-use the same ones for all our file.

RSpec.describe Purging::QuestionnaireWorker, type: :worker do
  let_it_be(:survey) { create(:survey) }
  let_it_be(:questionnaire) { create(:questionnaire, survey: survey) }
  ...
end

If we run the factory profiler we now have a different report:

[TEST PROF INFO] Factories usage

 Total: 3
 Total top-level: 3
 Total time: 00:00.272 (out of 00:24.264)
 Total uniq factories: 3

   total   top-level     total time      time per call      top-level time               name

       1           1        0.2024s            0.2024s             0.2024s             survey
       1           1        0.0323s            0.0323s             0.0323s             questionnaire
       1           1        0.0375s            0.0375s             0.0375s             question

So now we only create the factories we need, by reusing the same ones throughout our file.

Be aware that let_it_be come with a caveat section. I strongly encourage you to read the documentation and use this powerful helper in accordance with your needs.

🚀 Conclusion

Let’s take a step back and relish our improvements:

	Initial	Without cascades	With let_it_be
Factories creation time	00:01.00	00:00.868	00:00.272

Numbers look nice for this simple example. But what is the impact in real life at Potloc?

So far we just applied this recipe for a specific folder of our codebase. Below the result by profiling locally that folder:

Before we spent 3.50 min in factories creation, now 2 min. (~ -50%)
Before we created 6824 factories, now 4378. (~ -35%)

test-prof is the swiss army knife we needed to speed up our test suite. It’s still a long journey but by embracing this topic we have already taken an important step!

Want to go further? Watch this 99 problems of slow test talk by Vladimir Dementyev

Interested in what we do at Potloc? Come join us! We are hiring 🚀

Automatic "Ready for Review" Github Action

Jérôme Parent-Lévesque — Fri, 01 Apr 2022 18:40:47 +0000

TLDR: We wanted a GitHub Action to automatically assign reviewers and mark a draft pull request as "Ready for review" after our test suite passes. The final code can be found in this gist here.

At Potloc, our continuous integration process involves, among other things, a GitHub workflow running on each push that tests the code against our full test suite. This check must pass for a pull request to be merged.

Our test suite has gotten to a size where it is difficult to run on a personal computer in a reasonable amount of time, hence our developers usually rely on this GitHub workflow to run the full test suite.

The process looks something like this:

Push code for a new feature
Create a new Pull Request in "Draft" mode
Wait for all the tests to pass
Mark the Pull Request as "Ready for review" and assign reviewers

Note that we consider it a good practice to wait until tests pass before assigning reviewers in order to prevent notifying them only to realize that some more changes are necessary.

In practice, we have an in-house tool to help us automate most of these tasks through the GitHub CLI, but for a long time we didn't have a way to automatically mark a pull request as "Ready for review" when the all tests passed, meaning we had to wait and periodically check the status of each of our PR.

Inspired by Artur Dryomov's excellent post on Autonomous GitHub Pull Requests, we set out to create a GitHub Action to help us automate this.

Solution:

At the moment of creating the draft pull request, we want to be able to specify what to do in the event that all tests pass.

To achieve this, we will use a tag named autoready that we can put on our pull requests to signify that this PR should be automatically marked as "Ready for review" when all tests pass.

In addition, we want to be able to automatically assign reviewers when that happens. For that, we will be using a specific comment format that looks like this:

autoready-reviewers: reviewer1,reviewer2,organization/team1

Our workflow will automatically detect comments like this and assign each of the listed individual or team reviewers.

Implementation

GitHub Workflow configuration

Our workflow should run after each run of our Test workflow and use its output status to determine whether or not to mark the pull request as "Ready for review".
.github/workflows/ready_for_review.yml:



name: Ready For Review
on:
  workflow_run:
    workflows: ["Test"]
    branches-ignore: [main]
    types:
      - completed
jobs:
  mark_as_ready_for_review:
    runs-on: self-hosted
    if: ${{ github.event.workflow_run.conclusion == 'success' }}
    steps:
      - name: Checkout Code
        uses: actions/checkout@v3
      - name: Mark as Ready for Review
        run: bash .github/workflows/mark_as_ready_for_review.sh "${{ secrets.ACCESS_TOKEN }}" "${{ join(github.event.workflow_run.pull_requests.*.number) }}"

This will run our custom script mark_as_ready_for_review.sh after each successful run of the Test workflow.

Some noteworthy points:

We need the Checkout Code action to get the latest version of this mark_as_ready_for_review.sh script.
Our script takes a couple of arguments as input:
1. A GitHub access token of the "user" on behalf of whom we will be performing these automatic actions. In our case, we have a dedicated bot account for this. We store this value in a GitHub secret secrets.ACCESS_TOKEN.
2. A comma-separated list of all pull request IDs associated with this workflow run. Since a workflow run is attached to a particular commit hash, it is possible that multiple PRs have that same commit hash as HEAD.

Bash Script

Here is the script dissected and explained (scroll to the bottom for the full script):



#!/bin/bash
set -eou pipefail # Make sure we get useful error messages on failure

Our inputs and constants:



TOKEN="${1}"
PR_NUMBERS="${2}"
LABEL="autoready" # the name of the 'label' on the PR used to detect whether or not this script should run
REPO="your-repository" # the name of your repository on GitHub
ORGANIZATION="potloc" # the name of your GitHub organization or user to which the repository belongs

Then, we want to repeat the whole thing for as many pull requests as have been passed as input:



# Split the numbers string (comma-delimited)
for pr_number in $(echo $PR_NUMBERS | tr "," "\n"); do

Fetch the labels from the pull request. We will also need the Node ID of the PR to use GitHub's GraphQL API in a later step, so we also grab this at the same time.



# Get the node_id (and labels) from the PR number
# - https://docs.github.com/en/graphql/guides/using-global-node-ids
# - https://docs.github.com/en/rest/reference/pulls#get-a-pull-request
out=$(curl \
        --fail \
        --silent \
        --show-error \
        --header "Accept: application/vnd.github.v3+json" \
        --header "Authorization: token ${TOKEN}" \
        --request "GET" \
        --url "https://api.github.com/repos/${ORGANIZATION}/${REPO}/pulls/${pr_number}"
      )
node_id=$(jq -r '.node_id' <<< $out)
contains_label=$(jq "any(.labels[].name == \"${LABEL}\"; .)" <<< $out)
comments_url=$(jq -r ".comments_url" <<< $out)

# Check if the PR contains the label we want
if [ "$contains_label" == "true" ]; then
  # Continued below

Note that we use jq to simplify parsing of the JSON body returned by the GitHub API. This needs to be installed on the workers that will run this Workflow.

If the label exists on the PR, then we can mark is as "Ready for review". This API only exists in GitHub's GraphQL API, hence the different request. This is where we make use of the previously-retrieved node_id:



# Mark the PR as ready for review
curl \
  --fail \
  --silent \
  --show-error \
  --header "Content-Type: application/json" \
  --header "Authorization: token ${TOKEN}" \
  --request "POST" \
  --data "{ \"query\": \"mutation { markPullRequestReadyForReview(input: { pullRequestId: \\\"${node_id}\\\" }) { pullRequest { id } } }\" }" \
  --url https://api.github.com/graphql

Delete the label to prevent running this script for this PR:



# Remove the label
curl \
  --request "DELETE" \
  --header "Accept: application/vnd.github.v3+json" \
  --header "Authorization: token ${TOKEN}" \
  --url "https://api.github.com/repos/${ORGANIZATION}/${REPO}/issues/${pr_number}/labels/${LABEL}"

Finally, we want to find which reviewers to assign to this PR. To do this, we fetch all comments on the PR and use a regex to find a comment matching our autoready-reviewers: format we defined:



# Get the comments on the PR
comments_out=$(curl \
                --fail \
                --silent \
                --show-error \
                --header "Content-Type: application/vnd.github.v3+json" \
                --header "Authorization: token ${TOKEN}" \
                --request "GET" \
                --url $comments_url)

# Look for a comment matching the 'autoready-reviewers: ' pattern
# If found, assign the mentionned reviewers to review this PR
jq -r ".[].body" <<< $comments_out | while IFS='' read comment; do
  if [[ $comment =~ autoready-reviewers:[[:space:]]([a-zA-Z0-9,\-\/]+) ]]; then
    all_reviewers=${BASH_REMATCH[1]} # Get the first matching group of the regex (the comma-separated list of reviewers)

Using this list of reviewers, we differentiate between teams (e.g. potloc/devs) and individuals to assign by looking for the / character:



# Split the reviewers between teams and individuals
reviewers_array=()
team_reviewers_array=()
for reviewer in $(echo $all_reviewers | tr "," "\n"); do
  if [[ $reviewer =~ [a-zA-Z0-9,\-]+\/[a-zA-Z0-9,\-]+ ]]; then
    # In the case of a team reviewer, only take the part of the username after the '/':
    slug_array=(${reviewer//\// })
    team_slug=${slug_array[1]}
    team_reviewers_array+=("\"$team_slug\"")
  else
    reviewers_array+=("\"$reviewer\"")
  fi
done

# Join the array elements into a single comma-separated string:
reviewers=$(IFS=, ; echo "${reviewers_array[*]}")
team_reviewers=$(IFS=, ; echo "${team_reviewers_array[*]}")

The very last step is to make the API call to assign these individual and teams as reviewers to the PR:



# Assign reviewers

curl </span>

  --fail </span>

  --silent </span>

  --show-error </span>

  --output /dev/null </span>

  --header "Accept: application/vnd.github.v3+json" </span>

  --header "Authorization: token ${TOKEN}" </span>

  --request "POST" </span>

  --url "https://api.github.com/repos/${ORGANIZATION}/${REPO}/pulls/${pr_number}/requested_reviewers" </span>

  --data "{\"reviewers\":[${reviewers}], \"team_reviewers\":[${team_reviewers}]}"

Conclusion

And that is it! Now, to use this tool we can put the autoready label on a draft pull request and write a comment in the form autoready-reviewers: reviewer1,reviewer2,organization/team1.

In practice, at Potloc, we have a little helper in-house tool do these steps for us using the GitHub CLI and tty-prompt to
ease the selection of reviewers/teams and the formatting of this comment.

And this is what it looks like on GitHub's interface!

Interested in what we do at Potloc? Come join us! We are hiring 🚀

Full code:
https://gist.github.com/jeromepl/02e70f3ea4a4e8103da6f96f14eb213c

Streamline a service desk with JIRA automation

Clément Morisset — Mon, 17 Jan 2022 12:22:38 +0000

At Potloc our service desk is dedicated for the operation teams where they can open a ticket for a bug or a service request. The more our web application grows, the more maintenance it requires. As a result we’ve seen a growing number of tickets pertaining to the same project.

The need

We wanted a way to link these tickets together to gather additional context.

Were there any issues on this project in the past?
What was the solution?
Which developer worked on it?

These questions can be crucial in getting the visibility required when it comes to debugging.

The solution

One word: Automation

This aptly named feature allows you to automate tasks through a series of actions triggered by specific events.

Let’s link all created tickets for the same project using automation.

The how

In our particular case the event is When a new issue is created . The entry point for grouping tickets by a specific project is the url field. Each request has a link to easily locate the issue. In 99% of case our url will have the ID of a project. So the first action that you need to process after the creation of an issue is to scan the URL in order to extract the project ID.

1. Extract the survey ID

You can have a hidden field that is not displayed to the user but will allow you to store the extracted ID.
Let's add a New component > New action > Edit an issue and the field you want to use. For the purpose of this article we will use Root cause.

Above we are using Smart values, according to Jira documentation:

Smart values allow you to access issue data within Jira. For example, you can use the following smart values to send a Slack message that includes the issue key and issue summary: {{issue.key}} {{issue.summary}}

customfield_10364: is the ID for our URL field that contains the project ID. You can go to settings or inspect your input to find it.

substringBetween: is a built-in function that returns the text between the given parameters.

From this moment your field will contain the ID. But to have an up to date flow (with the extracted ID) we have to refetch the data issue by adding a new action Re-fetch data issue.

Next, we would like to link issues with the same project ID.

2. Find linked issues

Jira has an action for this called Lookup issues.

cf[10359] : is the identifier of our Root cause field (replace the number with your ID)

issue.customfield_10359 : is the identifier of the Root cause field for the other issues

issueKey != "issue.key" : we exclude our current issue of the lookup

⚠️ Careful! Your fields will be named differently according to your need.

What we do, in pseudo code, is basically searching via JQL if the extracted ID of the issue matches any previously-created issues. We also ensure we exclude the current issue of the search.

Note: In our context, Root cause is a text field. It's likely that you have to find another operator if the type of your field is different.

3. Link similar issues

Finally, we need to link all results from our previous query with our current issue using the Link issue to action.

That's it. From now on you should see previously-created tickets for this project on all your tickets.

Here’s the final result:

Pro tips: You could add a condition as a guard clause to stop the automation if the Lookup issues actions does not return anything.

Happy automation. 🎉

Interested in what we do at Potloc? Come join us! We are hiring 🚀

Optimally Taking Out Extra Survey Respondents

Jérôme Parent-Lévesque — Tue, 05 Oct 2021 17:54:34 +0000

Sometimes when analysing the results of a survey, one needs to remove some respondents from their sample. This is something we do fairly commonly at Potloc in order to obtain a more representative sample of the target population in our surveys. In other words, we use this as a way of performing stratified sampling.

We use a system of quotas to keep track of every agreement on the respondent sample we make with our clients. We have three types of quotas, each corresponding to a different way of assessing whether their target is met or not.

To match targets exactly (like we want to achieve for stratified sampling) we use one of these quota types - the strict quota type. For example, our clients might want exactly 50 respondents who work as electricians. No matter whether we have one respondent missing or one more than 50 in this category, our quota is not achieved.

The second type of quota we use is the minimum type. This type, as the name suggests, simply indicates that we must have at least as many respondents of a specific category as the target number.

The third and final type of quota is the weighted type. As we often use a weighting process to obtain a more representative sample of our population, we make sure to communicate with our clients where survey responses may be weighted. This communication in turn gets converted into quotas of type weighted which behave similarly to minimum quotas, but with more flexibility. The targets don't need to be matched exactly and will instead be achieved through an independent weighting process (don't worry, this will be explained in more details in Step 2 below). The "minimum" for this type of quota is (arbitrarily) set to 50% of the target as a way to limit the scale of the weights (this way weights should rarely be more than 2).

The image above shows an example of a combination of quotas we could have. Here, we want to end up with a minimum of one respondent who has a cat, exactly one respondent who is a doctor, and we want after weighting to have one effective respondent whose name contain the letter 'a' and two effective respondents whose name is shorter than 7 letters.

Imagine now that we have received the following responses to our survey:

In this initial state, we have:

1 respondent who has a cat (Alice)
2 doctors (Bob and Catherine)
2 respondents whose name contains the letter 'a' (Alice and Catherine)
2 respondents whose name is shorter than 7 letters (Alice and Bob)

The minimum quota is therefore satisfied (but Alice cannot be removed without breaking it) and there is one too many doctor. For weighted quotas, we always have at least 50% of the target number of respondents. As we will see later, the weighted quotas will be useful in determining the optimal respondents to take out.

We will keep referring to these quotas and respondents throughout this article to provide a practical example of how we select respondents to take out.

We set out to find the optimal selection of respondents to take out given a set of quotas such as this one. Below is the full step-by-step explanation of the algorithm we use to perform this and an example of how it is applied to this fictional set of quotas and respondents.

Step 1

We first identified that by determining which respondents belonged to which quotas, we could split the respondents into 3 different categories:

Respondents that cannot be taken out are respondents that belong to quotas for which the target is not exceeded. For example, our minimum quota "has a cat" has a target of 1 and only Alice fits into this category. Therefore, Alice cannot be taken out as otherwise the "has a cat" quota would be broken. The same goes for quotas with fewer respondents than the target, for example if the target was to have 2 respondents who own a cat.

Respondents that should be taken out as a priority are respondents that belong specifically to a strict quota for which the target is exceeded. Since for this type of quota we want to end up with exactly the target number of respondents, we have to take out respondents belonging to this quota until that target is matched. This group takes priority over the Respondents that cannot be taken out as we prioritise taking out respondents in exceeded strict quotas until those quotas are satisfied.

Respondents that may be taken out are all remaining respondents. These respondents may or may not belong to any quota. If they do, then that quota's target has to be exceeded — otherwise they would be in the Respondents that cannot be taken out category. Note that these respondents logically cannot belong to any strict quota since those belonging to this type of quota must fit in one of the first 2 categories.

Going back to our example, our three respondents would belong into the following groups:

The respondents that should be taken out as a priority group includes both doctors (Bob and Catherine) as there is one too many to satisfy the strict quota. Alice cannot be taken out because she is the only respondent who has a cat. The last group is empty as all respondents already belong to other groups.

Step 2

Now that we have a categorisation of each respondent, we are almost ready to start taking out respondents. However, since our objective is to optimally take out respondents, we need to compute one more piece of data related to weighted-type quotas.

First, we need to define a bit better what we mean by optimally here.
Our survey results are usually calculated on weighted data in order to better match the target population demographics.

In other words, as part of our survey workflow, we compute a weight for each respondent and use it as a multiplicative factor to scale the "importance" of each survey response. This is a process called weighting.

The weights can be interpreted as a measure of the quality of our respondents sample by looking at their distribution. The further away from the value of 1 the weights are, the worse the quality. Indeed, a small weight indicates that we have too many similar respondents and a large weight indicates that we are missing respondents with similar characteristics.
For more details on the weighting process I invite you to read my previous blog post on Generalized Weighting.

Thus, when taking out extra respondents, we would like to ensure that our weighting quality will be unaffected. This is the key to our notion of optimality — we want not only to satisfy all quotas but also to obtain the highest possible weighting quality as a result.

To achieve this, we compute weights for each respondent based on the targets of the weighted quotas. Using a raked weighting algorithm, we use all weighted quota numbers as "targets" to obtain respondent weights.

In our example, we obtain the following weights by using the targets from the two weighted quotas:

Notice that by multiplying the respondent's (numerical) answer the weighted quota targets are matched perfectly! The count of respondents whose name contains the letter 'a' becomes 1 (from 2) as it is now the sum of 0.59 and 0.41. Meanwhile, the count of respondents whose name is shorter than 7 letters stays 2, although the weight of each respondent differs.

In the next step, we will be removing respondents with the smallest weights first whenever we cannot decide who to take out!

Step 3

Our respondents now belong to one of the 3 categories presented in Step 1 and each have a weight resulting from the raked weighting computation from Step 2. It is now time to start taking out respondents.

The core of the strategy here is to take out respondents one-by-one. After each respondent that is taken out, our quotas and weighting need to be updated, meaning that steps 1 and 2 need to be performed again! We therefore perform this step in a loop where in each iteration we recompute the first 2 steps before choosing and taking out 1 respondent.

This respondent is chosen according to the given priority list:

Select a pool of respondents to pick from:
- If there are any respondents in the Respondents that should be taken out as a priority category, then limit our selection to this group only
- Otherwise, if there are any Respondents that may be taken out, select this group
From this pool, select the optimal respondent to be taken out:
- The optimal respondent corresponds to the respondent with the smallest weight, since a small weight indicates that we have many similar respondents
- In the case of a tie, or if there are no weighted quotas, we remove the last respondent to have answered the survey. (Note: the statistically correct thing to do here would be to remove a random respondent from the pool, but we choose this approach as it is idempotent — we can re-run the algorithm and the selected respondents will be the same. Additionally, this replicates the behaviour of traditional sampling tools that have quota-stops)
Take out the selected respondent and repeat from Step 1 until all strict quotas are met!

Here's how this would play out in our fictional example:

We have 2 respondents in the Respondents that should be taken out as a priority category (Bob and Catherine) and thus only these respondents are taken into consideration
To determine who to remove from these two respondents, we take a look at their data:

Since Catherine has the smallest weight (0.41 vs. 1.41), she is taken out. Intuitively, this makes sense as we had one too many respondent whose name contained the letter 'a' to satisfy the weighted quota without even applying a weighting.

We are now left with one respondent whose name contains the letter 'a' and two respondents whose name is shorter than 7 letters, meaning that our final weights will be exactly 1 — the optimal value for weights!

Additionally, now that respondent "Catherine" has been taken out, all of our quotas are satisfied and we can stop the algorithm here.

Conclusion

Using this process, we are able to remove respondents so as to match our quotas as best as we can, while also leading to a better survey weighting. Indeed, since we always remove respondents with the smallest weights, our weighting gets progressively better as the minimum weight gets closer and closer to 1 (the optimal value). This means that the final data presented for this survey — after the weighting step — will be more representative of the target population, a win for both Potloc and our clients!

Interested in what we do at Potloc? Come join us! We are hiring 🚀

Appendix - The case of multiple overlapping `strict` quotas

We might sometimes have respondents that correspond to multiple different strict quotas. In this scenario, it is more complicated to select the optimal respondents to take out as it is not always obvious what is the smallest possible set of respondents that need to be removed in order to satisfy all such quotas. It is, for example, possible to have a respondent (let's call them respondent A) that we take out since it belongs to a strict quota which have both exceeded their target. However, it is possible to then take out other respondents which also correspond to this quota because they also belong to another strict quota which was exceeded. This could now break the first quota if it had met its target exactly. In this scenario, we end up with a respondent (respondent A) which can now be reinstated as the strict quota it belonged to is now under its target.

To alleviate this problem while avoiding a complicated and expensive decision process solutions from the field of operational research, we employ two mechanisms.

First, we try to more optimally pick which strict quota respondents to take out first. To do this, we also consider the following factors:

Whether the respondent can be disqualified or not (if all of its quotas are exceeded)
The number of exceeded strict quotas a respondent is a part of (more = higher priority)
The total number of strict quotas a respondent is a part of (fewer = higher priority)
- This is used to minimise the impact of taking out respondents on other quotas
The minimum difference between the current count and the target count of strict quotas that are exceeding their target (bigger = higher priority, as there is more room to remove respondents)

Second, we add a final step at the end of the process in which we restore respondents that can be without breaking any quota. This solves the issue highlighted in the example above.

Using these two approximations we are able to get a result that is close to optimal, for a minimal cost.

How to use Sentry for profiling a test suite

Clément Morisset — Mon, 20 Sep 2021 20:11:37 +0000

At Potloc one of our core values is learning, that's why each quarter the development team has a dedicated time to explore new things. This aptly named Dev Happiness Week allows us to tackle either a new pattern, open source project, write a blog post and so on.

One of my initial projects was to monitor our test suite. The more our test suite was growing quickly the more seconds were added to our CI. But without any monitoring on how long each test takes to run it was complicated to identify and speed up the slowest ones. Let's fix this by asking the following question:

What is the minimal valuable move to have a test profiling dashboard ?

We use Sentry for error tracking and they have a great feature for performance monitoring that allow you to track queries and get transactions duration.

Hence the following steps will show you how to twist the performance overview page of Sentry into a test suite monitoring. 🤓

Disclaimers: The following example assume that you use RSpec gem and that you already have a dedicated Sentry environment for it (eg: app-test-suite).

We would like to satisfy three specifications:

Cover all of our test suite
Don't flood our monitoring with unnecessary data (eg: unit tests that run in under 1 second)
Make it easy to locate our slower tests

Cover all of our test suite

The easy one. Let's start by creating a profiling.rb file that will be run in our test suite, add an around hook and compute the elapsed time between the beginning of a run and the end.

#spec/support/config/profiling.rb

RSpec.configure do |config|
  config.around(:example) do |example|
    example_start = Time.now

    example.run

    example_end = Time.now
    elapsed_time_in_seconds = example_end - example_start
  end
end

Nice, we now cover the first bullet point of our specification.

Don't flood our monitoring with unnecessary data

The purpose here is to avoiding flooding Sentry when everything work as expected. We just want to send data when a test is slow or takes longer than expected.

We don't have the same expectations between a system test and a unit test. Hence we have to set a different thresholds accordingly:

#spec/support/config/profiling.rb

SLOW_EXECUTION = ["system"].freeze
MEDIUM_EXECUTION = ["integration", "export"].freeze
SLOW_EXECUTION_THRESHOLD = 6
MEDIUM_EXECUTION_THRESHOLD = 3
DEFAULT_EXECUTION_THRESHOLD = 1

RSpec.configure do |config|
  config.around(:example) do |example|
    category = spec_category(example)
    example_start = Time.now

    example.run

    example_end = Time.now
    elapsed_time_in_seconds = example_end - example_start
    if execution_time_in_seconds > threshold_for(category)
      print "Slow alert!"
    end
  end
end

# input: "./spec/integration/restaurants/owner_spec.rb:16"
# output: "integration"
def spec_category(example)
  location = example.metadata[:location].to_s
  location.gsub("./spec/", "")
          .scan(%r{^[^/]*})[0]
end

def threshold_for(category)
  case category
  when *SLOW_EXECUTION then SLOW_EXECUTION_THRESHOLD
  when *MEDIUM_EXECUTION then MEDIUM_EXECUTION_THRESHOLD
  else
    DEFAULT_EXECUTION_THRESHOLD
  end
end

Firstly we set our threshold constants, corresponding maximum execution times.
Then we add a spec_category method that allows us to identify the type of test, and we check if the runtime is lower or higher than expected.
If it does we print a beautiful message.

We are almost there! We just have to generate a custom Sentry transaction in order to populate the performance table with our profiling data.

#spec/support/config/profiling.rb

...

RSpec.configure do |config|
  config.around(:example) do |example|
    category = spec_category(example)
    example_start = Time.now
    transaction = Sentry::Transaction.new(op: category.upcase, hub: Sentry.get_current_hub)

    example.run

    ...

    if execution_time_in_seconds > threshold_for(category)       
      transaction.finish
    end

    ...

Our Sentry::Transaction takes 2 arguments:

op : used for the name of the operation (eg: sql.query ). In our case we will populate it with the category of our spec.
hub: Sentry requires an hub, we use the current. According to documentation "You can think of the hub as the central point that our SDKs use to route an event to Sentry"

By finishing our transaction we send our event to Sentry. From now on you should see your first tests appear on the performance dashboard.

Feel free to read Sentry documentation for a better understanding the performance table.

⚠️ By tweaking the performance dashboard of Sentry for our test suite we have to deal with the expected behaviour of Sentry. Such as send event when there is fail in your environnement

That's why it's necessary to add the following configuration in our Sentry test initializer.

#spec/support/config/sentry.rb

Sentry.init do |config|
  config.dsn = "https://4**************************************3"
  config.excluded_exceptions += ["Exception"]
end

In our context of profiling we won't send an event when a test fails on our CI. For avoiding noise and flooding you have to exclude all exceptions.

Make it easy to locate our slower tests

By default the transaction name corresponds to the name of the controller where the transaction has been run. In the screenshot above it shows that our test uses the method template in ChartsController. It's not really convenient to identify what test this concerns.
We need to tweak our transaction a little bit more:

#spec/support/config/profiling.rb

...

RSpec.configure do |config|
  config.around(:example) do |example|
    ...

    if execution_time_in_seconds > threshold_for(category) 
      assign_custom_transaction_name(example)
      transaction.finish
    end
  end
end

...

def assign_custom_transaction_name(example)
  Sentry.configure_scope do |scope|
    scope.set_transaction_name(example.metadata[:location])
  end
end

By using the configuration scope we are able to update our transaction name. According to documentation, "The scope will hold useful information that should be sent along with the event".

Here we assign our test path as the transaction name. That's it.
Run your test again and you should see

For the purpose of development we didn't add a guard clause so far, but you can check if you are on your CI environment before profiling your tests.

The entire profiling.rb class should look like this.

#spec/support/config/profiling.rb

SLOW_EXECUTION = ["system"].freeze
MEDIUM_EXECUTION = ["integration", "export"].freeze
SLOW_EXECUTION_THRESHOLD = 6
MEDIUM_EXECUTION_THRESHOLD = 3
DEFAULT_EXECUTION_THRESHOLD = 1

RSpec.configure do |config|
  if ENV["CI"]
    config.around(:example) do |example|
      category = spec_category(example)
      example_start = Time.now

      example.run

      example_end = Time.now
      elapsed_time_in_seconds = example_end - example_start

      if execution_time_in_seconds > threshold_for(category)         
        assign_custom_transaction_name(example)
        transaction.finish
      end
    end
  end
end

# input: "./spec/integration/restaurants/owner_spec.rb:16"
# output: "integration"
def spec_category(example)
  location = example.metadata[:location].to_s
  location.gsub("./spec/", "")
          .scan(%r{^[^/]*})[0]
end

def threshold_for(category)
  case category
  when *SLOW_EXECUTION then SLOW_EXECUTION_THRESHOLD
  when *MEDIUM_EXECUTION then MEDIUM_EXECUTION_THRESHOLD
  else
    DEFAULT_EXECUTION_THRESHOLD
  end
end

def assign_custom_transaction_name(example)
  Sentry.configure_scope do |scope|
    scope.set_transaction_name(example.metadata[:location])
  end
end

That's all, happy profiling ! 🎉

Interested in what we do at Potloc? Come join us! We are hiring 🚀

OAuth Tokens & Potlock gem

Thibault Couraud — Fri, 17 Sep 2021 15:57:25 +0000

A bit of context 👋🏽

When calling Apis that use OAuth as authentication process, you need to generate an access token. And to get an access token, we have to use a refresh token stored in the server.

Here's the OAuth workflow to generate this access token:

Image source: developer.ebay.com

So what was the need? 🤔

An access token expires after a certain time, in minutes, hours, days, depending on the provider. So we need to refresh it time to time.

The issue was that different processes were refreshing the token at the same time, invalidating other's freshly generated access token.

So we had to find a way to be sure that only one process can refresh the token.

Here comes the gem 🚀

Today we introduce our new gem: Potlock - a Distributed Read-Write lock using redis

(available on Github here: GitHub - potloc/potlock)

This brand new gem only allows one simultaneous reader or writer. And if the lock is taken, any readers or writers who come along will have to wait.

Here's an example of how we use this gem at Potloc:

def token
  lock = Potlock::client.new(key: "snapchat_api")

  # Fetch the token, refresh it if not present
  token = lock.fetch { refresh_token! }

  # A token is invalid when empty or expired
  raise InvalidToken unless valid?(token)

  token
rescue InvalidToken => _e
  # Generate and save a new token
  lock.set { refresh_token! }
  token = lock.get
end

This way, we are sure that all the processes will have the same valid access token and won't overwrite it at the same time 🎉

Interested in what we do at Potloc? Come join us! We are hiring 🚀

Generalized Raking for Survey Weighting

Jérôme Parent-Lévesque — Tue, 29 Jun 2021 15:28:44 +0000

In the world of surveys, it is very common that our acquired responses need to be weighted in order to achieve a sample that is representative of some target population. This process of weighting simply consists of assigning a weight (a.k.a. factor) to each respondent, and calculating all survey results as a weighted sum of respondents.

For example, we might have surveyed 100 male respondents and 150 female respondents but were targeting a male / female ratio of 48% / 52%. In this simple case, we could achieve the target ratio by weighting the male responses by a factor of 0.48 / (100 / (100 + 150)) = 1.2 and weighting the female responses by 0.52 / (150 / (100 + 150) = 0.867.
The technical term for this method of computing weights is Post-Stratification.

However, in a more complex scenario, where we have many different measurable demographic targets, how can we determine weights for all the survey respondents?

Raking

At Potloc, it is very common that our clients desire survey populations matching a lot of such targets. For example, we might have targets looking like this:

42% male
58% female
20% students
80% non-students
15% dog owners
...

In this setting, weights cannot be calculated using a simple ratio as in the male/female example shown above. Here, we instead need to rely on more involved algorithms, notably a process called Raking.

Iterative Proportional Fitting

One common approach to solve the problem of finding good weights that will satisfy our demographic targets is Iterative Proportional Fitting. Typically in the industry, when the term "raking" is used it refers to this algorithm. In this method, weights for each respondents are computed for a single target at a time using Post-Stratification. By iteratively computing this for each target and repeating a few times, the weights end up converging to values that satisfy our targets.

Great! Problem solved!

...but what if we could do even better? 🤔

Generalized Raking

Beyond satisfying the demographic targets, the most desirable property for the weights is that they should be as close as possible to 1. Indeed, weights that are really large mean that those respondents' responses will count for a lot more than the "average" respondent in our survey results. For example, a respondent with a weight of 10 will count for 10 times more than the average respondent, and 100 times more than a respondent with weight 0.1 . Similarly, small weights mean that some responses will have very little impact on the final results.

Unfortunately, Iterative Proportional Fitting does nothing to encourage weights to be close to 1, which leads to sub-optimal weights. This is where Generalized Raking, an algorithm introduced by Deville et al. (1992), comes into play.

Note: This is where we get into the more mathematical part of this blog post 🤓. Don't care about this part? No worries! Simply skip to the next section!

The authors of this paper formulated the weighting problem as a constrained optimization method where the objective is that the weights are as close to one as possible and where the constraint is that the targets are matched. Mathematically this looks like this:

\argmin_{w} \; G(w) \;\; \text{s.t.} \; X^T w = T \newline G(x) = x(\log(x) - 1) + 1

where $G (x)$ is the raking function which encourages weights to be close to 1, $w$ is the vector of weights, $T$ is the vector of targets (in absolute numbers, not percentages) and $X$ is the $(numRespondents×numTargets)(\text{numRespondents} \times \text{numTargets})$ matrix of responses. The matrix $X$ is binary where cells are filled with a '1' if the respondent belongs to the target category and '0' otherwise.

In other words, this is saying that we want to optimize the weights to be as close to 1 as possible while satisfying the target constraints. This is achieved by minimizing the function $G (x)$ which looks like this (notice the global minimum at $x = 1$ !):

The Generalized Raking Algorithm

While it is possible to solve this optimization problem using general methods such as Sequential Least Squares Programming, the authors of Generalized Raking have devised a more efficient and robust algorithm for this specific problem:

Initialize variables
- A $(numRespondents×1)(\text{numRespondents} \times 1)$ vector $w$ to ones
- A $(numTargets×1)(\text{numTargets} \times 1)$ vector $λ\lambda$ to zeros
- A $(numRespondents×numRespondents)(\text{numRespondents} \times \text{numRespondents})$ square matrix $H$ to the Identity matrix
While the weights have not converged, repeat:
1. $λ=λ+(XTHX)−1(T−XTw)\lambda = \lambda + (X^T H X)^{-1} (T - X^T w)$
2. $G^{-1}(X \lambda)$
3. $\text{diag}({G^{-1}}'(X \lambda))$

Here, $G^{-1}(x)$ is the inverse of the derivative of the raking function, i.e. $e^x$ .
${G^{-1}}'$ is its derivative, in this case also $e^x$ .

The final value of $w$ corresponds to the weighting factors we are looking for!

Implementation

While there are many implementations of this algorithm in R, we were not able to find one in Ruby that could play well with our codebase and be easily maintainable.
We therefore decided to make our own and to share it here for anyone looking for something similar. We started by making an implementation in python with the popular numpy library:

import numpy as np

def raking_inverse(x):
  return np.exp(x)

def d_raking_inverse(x):
  return np.exp(x)

def graking(X, T, max_steps=500, tolerance=1e-6):
  # Based on algo in (Deville et al., 1992) explained in detail on page 37 in
  # https://orca.cf.ac.uk/109727/1/2018daviesgpphd.pdf

  # Initialize variables - Step 1
  n, m = X.shape
  L = np.zeros(m) # Lagrange multipliers (lambda)
  w = np.ones(n) # Our weights (will get progressively updated)
  H = np.eye(n)
  success = False

  for step in range(max_steps):
    L += np.dot(np.linalg.pinv(np.dot(np.dot(X.T, H), X)), (T - np.dot(X.T, w))) # Step 2.1
    w = raking_inverse(np.dot(X, L)) # Step 2.2
    H = np.diag(d_raking_inverse(np.dot(X, L))) # Step 2.3

    # Termination condition:
    loss = np.max(np.abs(np.dot(X.T, w) - T) / T)
    if loss < tolerance:
        success = True
        break

  if not success: raise Exception("Did not converge")
  return w

Ruby Implementation

After validating the algorithm in python, we then proceeded to replicate it in Ruby. For this, we had to find an equivalent to numpy which we found in Numo. Numo is an awesome library for vector and matrix operations, and its linalg sub-library was perfect for us as we needed to compute a matrix pseudo-inverse. This allowed us to translate the code to Ruby almost line by line:

require "numo/narray"
require "numo/linalg"

def raking_inverse(x)
  Numo::NMath.exp(x)
end

def d_raking_inverse(x)
  Numo::NMath.exp(x)
end

def graking(X, T, max_steps: 500, tolerance: 1e-6)
  # Based on algo in (Deville et al., 1992) explained in detail on page 37 in
  # https://orca.cf.ac.uk/109727/1/2018daviesgpphd.pdf

  # Initialize variables - Step 1
  n, m = X.shape
  L = Numo::DFloat.zeros(m)
  w = Numo::DFloat.ones(n)
  H_diag = Numo::DFloat.ones(n)

  success = false

  max_steps.times do
    L += Numo::Linalg.pinv((X.transpose * H_diag).dot(X)).dot(T - X.transpose.dot(w)) # Step 2.1
    w = raking_inverse(X.dot(L)) # Step 2.2
    H_diag = d_raking_inverse(X.dot(L)) # Step 2.3

    # Termination condition:
    loss = ((T - X.transpose.dot(w)).abs / T).max
    if loss < tolerance
      success = true
      break
    end
  end

  raise StandardError, "Did not converged" unless success
  w
end

You may have noticed that the code doesn't quite match exactly the algorithm described above, notably steps 2.1 and 2.3. This is because we have found it to be vastly faster with Numo to store the sparse matrix $H$ as a flat vector h_matrix_diagonal since it only contains values on the diagonal. As a result, the step of taking the product $X^T H$ can be rewritten as X.Transpose * h_matrix_diagonal, making use of Numo's implicit broadcasting.

In practice, we optimize this code a bit further by exiting early whenever possible (for example if our loss becomes NaN) and by allowing to pass as input an initial value for the vector lambdas if we believe to have an initialisation value better than the default.

With these few lines of code, we are now able to support complex survey weighting scenarios while having all of our code in our beautiful Ruby monolith 🎉

Interested in what we do at Potloc? Come join us! We are hiring 🚀

DEV Community: Potloc

How to safely rename STI models in Rails

The Solution

Step 1: Renaming the model

Step 2: The data migration

Step 3: Cleanup

Conclusion

Data Analytics at Potloc I: Making data integrity your priority with Elementary & Meltano

Foreword

Data Integrity - It's more than a buzzword

Features

1. The UI, the glorious UI:

2. Stored results and historical views

3. Anomaly detection tests

Integrating Elementary into your existing Meltano Project

Generating your first report

Next steps

A quick closing statement

How to optimize factory creation.

🤕 The issue

🧑‍🚒 The solutions

🚀 Conclusion

Automatic "Ready for Review" Github Action

Solution:

Implementation

GitHub Workflow configuration

Bash Script

Conclusion

Streamline a service desk with JIRA automation

The need

The solution

The how

Optimally Taking Out Extra Survey Respondents

Step 1

Step 2

Step 3

Conclusion

Appendix - The case of multiple overlapping strict quotas

How to use Sentry for profiling a test suite

Cover all of our test suite

Don't flood our monitoring with unnecessary data

Make it easy to locate our slower tests

OAuth Tokens & Potlock gem

A bit of context 👋🏽

So what was the need? 🤔

Here comes the gem 🚀

Generalized Raking for Survey Weighting

Raking

Iterative Proportional Fitting

Generalized Raking

The Generalized Raking Algorithm

Implementation

Ruby Implementation

References

Appendix - The case of multiple overlapping `strict` quotas