DEV Community

Jimmy Yeung
Jimmy Yeung

Posted on

Confident feature release - use dry-run

Background

We have feature releases every day - and in every case, we would love to verify if the business logic of that feature is correct or not.

One normal approach is to

  • have a feature flag, all codes are hidden behind the feature flag and we only switch to the new flow when the feature flag is on.
  • Do unit tests, integration tests, stress tests for possible scenarios
  • Have the QA team to carry out end-to-end or even further exploratory tests before flipping the feature flag.

It looks promising. But sometimes the flow is so complicated that we may not cover all possible scenarios in our test suites. How could we be more confident?

Dry Run Mode

In one project (e.g. a web application using Django, React, Postgres), we were doing a large refactoring of an existing feature. We introduced the dry run mechanism so that we can simultaneously run the old endpoint and new endpoint; but the new endpoint is only in dry run mode. In this particular case, it could mean creating new db columns with prefix dry_run - and we are only writing to the dry_run_* columns in dry run.

Behaviour

With feature flag OFF

  • use old endpoint as source-of-truth
  • but call the new endpoint with a new boolean request parameter called dry_run. With dry_run is True, we just update the new dry_run_* columns.
  • We could report any discrepancy if it turns out the dry_run_* is NOT the same as the original ones.

With feature flag’s OFF, we are observing no error + data (dry_run_*) is correct + QA passes; we could toggle the feature flag with confidence.

Benefits

The dry run mechanism indeed helps us to detect the missing logics with real user behaviours. It’s useful since it’s always hard to mimic all user behaviours in QA environment.

Some Gotchas

Although it brings some benefits, we also need to be cautious to avoid affecting production results. Here are some thinking directions and examples:

1. Avoid saving stale data

Usually we make use of .save() in Django for updating. It will generate a query like UPDATE … SET <ALL_FIELDS> … WHERE id = … to update all the columns. For some critical queries, we have applied pessimistic lock with select_for_update so that the rows returned by SELECT query are locked until the entire transaction is committed.
However, not all updating queries apply pessimistic lock. Reasons could be

  • there’s only one endpoint updating the table - we make use of this nature, so that we don’t apply the locking to have some performance gain.
  • They rarely hit concurrency issues so it’s not worthwhile to introduce locking
  • etc.

And in the parallel run world, we’re writing to the same rows and thus saving stale data is a big concern. It also might not be worthwhile to introduce locking to the original flow because it slows down the performance.

How we’re trying to solve this problem

  • We made use of update_fields argument in .save() to specify which fields to update. This way we don’t need to apply locking. But we should remove them as part of the feature flag cleanup process to avoid confusion.
# Old endpoint: 

update_fields = [
    f.name
    for f in SomeTable._meta.get_fields()
    if f.name not in ("id", "dry_run_column_1", "dry_run_column_2")
    # Boolean flag that indicates if the field has a database column associated with it
    # Source:
    # https://docs.djangoproject.com/en/4.2/ref/models/fields/#django.db.models.Field.concrete
    and f.concrete
]
row.save(update_fields=update_fields)

# New endpoint

update_fields = ["date_modified", "dry_run_column_1", "dry_run_column_2"]
row.save(update_fields=update_fields)

Enter fullscreen mode Exit fullscreen mode

2. Keep performance in mind

Since we’re simultaneously running 2 flows, extra overhead shall be introduced. We should always keep performance in mind and incur as less overhead as possible

How we’re trying to solve this problem

  • We make use of void instead of await in React frontend very often when we try to call the new endpoint in the frontend. In this way, the frontend doesn’t require waiting for the response before proceeding.

3. Avoid doing same behaviour as the original flow when it’s in dry run mode

It sounds easy, but we need to do it with extra care. E.g.
Make sure it’s a read-only endpoint if we don’t need to write anything in dry run mode. One way of doing this is using

SET SESSION CHARACTERISTICS AS TRANSACTION read only;
Enter fullscreen mode Exit fullscreen mode

in postgres to ensure we're only doing read-only operations to db. Also we should avoid publishing any user-interaction events for analytics (if have) when it's in dry-run mode.

4. Avoid anything which could disrupt the original flow

E.g. We could throw an error from the new endpoint, but we definitely don’t want to show the error to users if it’s in dry run mode; because it has nothing to do with the current flow.

5. Deprecate the dry run mode as part of the clean-up

  • Always keep the codebase clean, we should deprecate the dry run mode afterwards by: Removing the dry_run_* columns and make sure to sync with other skateholders, make sure to clean up everything in downstream usages.
  • Removing the code branches if it’s unnecessary after feature flag is on
  • etc.

Conclusion

To me, having the new logic with dry-run mode in production is a double-edged sword. On one hand, it is helpful to ensure the business logic is correct; but on other hand, it also creates loopholes which may affect the original flow too. We just need to handle it with extra care.

Thus it may not worth to do dry-run for every feature release. But it's worth considered doing so if that feature is of paramount importance (e.g. directly tied to the company's income).

TL;DR, do at your own risk ;D Hope that's helpful.

Top comments (0)