The simplest Git branching flow for dbt Cloud

There are many posts about Git branching strategies out there, but they're either light on details or heavy on complexity. My aim here is to define the simplest possible production-grade Git branching strategy for an analytics engineering team. Ideally, nothing should be able to be removed and nothing needs to be added. If you disagree, leave a comment down below!

The simplest feature branching flow

The absolute simplest feature branching flow is described very well in this official dbt article. There is a main branch off of which you create your feature branches. The main branch corresponds to the production schema, and pull requests from feature branches ideally go to temporary schemas. Only modified tables should run with state deferral to main (aka slim CI) in these temporary schemas.

Another name for this branching flow methodology is trunk-based development.

Consolidating models from multiple pull requests in one schema

Ideally, your data visualization tool should be dynamic enough to easily switch between different schemas in your data warehouse. That way, users trying to do user acceptance testing (UAT) can just point the data viz tool to the pull request schema containing the change they're reviewing.

However, if your data visualization tool doesn't support easily switching between schemas (e.g. Tableau), the best you can do for user acceptance testing (UAT) is to consolidate just certain models in a single schema. The simplest way to perform this consolidation is to use an implementation like the below for your generate_schema_name macro:

{% macro generate_schema_name(custom_schema_name, node) -%}
    {%- if target.name == "pull-request"
        and node.config.meta.get("replace_schema_with_uat", none) == target.schema
        and node.config.meta.get("schema_uat", none) is not none -%}
        {{ node.config.meta.schema_uat }}
    {%- else -%}
        {{ custom_schema_name or target.schema }}
    {%- endif -%}
{%- endmacro %}

With the above definition of generate_schema_name, if a dbt model you're working on has the following metadata attributes set like so:

{{
    config(
        meta={
            "replace_schema_with_uat": "dbt_cloud_pr_1234",
            "schema_uat": "uat",
        },
    )
}}

Then for pull request job runs only, if the pull request ID is "1234", the model's schema will be uat instead of dbt_cloud_pr_1234. IDE development and production jobs won't be affected.

If a model doesn't have either replace_schema_with_uat or schema_uat set, this macro will always keep the default schema for it.

To take advantage of the macro, you would:

Set the target name of your pull request job to "pull-request" in the job's settings in dbt Cloud.
Define the metadata attribute schema_uat for your models, either in the config block like above or in dbt_project.yml. This defines the name of the central UAT schema for your models. Note that different models can have different central UAT schemas.
Create a pull request with your changes and note the name of the schema automatically generated for the pull request. Initially all models for your pull request will be in that schema.
Set your replace_schema_with_uat metadata attribute to the name of the pull request schema (for example dbtcloud_pr_1234). Commit and push the changes. Now the affected models will be materialized in the central UAT schema defined by the attribute schema_uat instead of the pull request schema.
Suppose you've merged the changes and have opened a new PR with more changes. Since the name of the PR schema won't be identical to the one defined in replace_schema_with_uat, all models will once again materialize in the PR schema. This forces developers to manually set which models they want materialized in the central UAT schema every time. This is good because it prevents unintended conflicts between PRs. The metadata attribute replace_schema_with_uat can be safely left with its original value - it won't hurt anything.

If two people are modifying the same model in different pull requests, and they both set replace_schema_with_uat for that model to their corresponding pull request schemas, then the table/view in the central schema will reflect the logic of the one who pushed last. In such cases, developers will have to coordinate and take turns. Two versions of the same model can't go through central UAT at the same time.

Obviously, feel free to change or extend the macro. For example, you could add the additional attribute schema_prod for models for which you want to override the production schema as well.

Now you don't need a long-lived uat branch to perform UAT from a central location! You can still make do with one long-lived main branch and many short-lived feature branches.

Adding a pre-production environment

Starting with one main branch for production and doing all your testing in feature branches/pull requests will probably work just fine for small to medium sized organizations. Larger organizations may need additional environments. However, that doesn't mean that you need to create long-lived branches!

By default, trunk-based development advocates for release branches. However, I believe that breeding all those branches is overkill for data teams, and instead advocate for the simpler release from trunk methodology.

If we want to have a pre-production environment, we can still utilize the main branch for both the production and the pre-production environments by tagging commits that are ready for production release.

This way, the latest commit in main is always pushed to pre-production environment #1, whatever you want to call it. When the team feels confident that the change can be pushed to production, they tag that commit with a production release version number, and a separate CI process that watches for tags then pushes the changes to the production environment.

Now you have your temporary schemas, one for each pull request, the 'bleeding edge' main that points to the pre-production environment, and the production environment that only gets updated when a new version is tagged in main.

Note that the CI that's built into dbt Cloud can support the basic feature branching flow out of the box, but it doesn't support git tag release strategies. This pushes folks unnecessarily into creating multiple branches for multiple environments in situations where simple tags would have served them just fine.

One option is to manually update the environment's "custom branch" in dbt Cloud settings every time there's a new release.

The other option is to do the same thing, but automatically via the API as soon as a commit in the main branch is tagged. There's an existing project that can be used as a reference. I'll update the post if I get around to creating an automated process myself.

Adding a second pre-production environment

For some organizations, one pre-production environment is not enough, and they insist on two. This is still easy to do! We just have to utilize release candidate tags for the new pre-production environment.

Suppose our pre-pre-production environment is named TEST, and our pre-production environment is named STAGE. TEST corresponds to the latest commit in the main branch - that's the 'bleeding edge'. STAGE corresponds to the latest release candidate tag on the main branch. In semantic versioning, this would be achieved by adding the suffix -rc.N to the name of the release it's targeting. For example, if our goal is to create production release v12.0.0, our STAGE environment commits would be tagged v12.0.0-rc.1, then v12.0.0-rc.2, and so on. Suppose on v12.0.0-rc.5 we finally feel confident enough to push to production. We would then add the tag v12.0.0 to the same commit, which would constitute a full release and then be automatically deployed to production.

Need more environments/branches/options?

There are many Git branching models and variations to choose from. See this overview to learn more. Do you believe you've found an even simpler flow? Let me know in the comments!

DEV Community