DEV Community

Karan Singh Chandel
Karan Singh Chandel

Posted on

I Built a CLI Tool to Fix dbt Governance Problem Using DataHub

The Problem

I use dbt to write transformations, manage dependencies, run tests, and keep everything in version control. For building data pipelines, dbt is excellent.

But when it comes to governance, dbt falls short.

I wanted to know who owns each model. I wanted to enforce documentation standards. I wanted to catch when someone depends on deprecated data. I wanted these checks to run during development, in my terminal, in CI, not after deployment.

dbt doesn't do this. It's a transformation tool, not a governance platform.

So I built a CLI tool that brings governance into the dbt workflow by connecting it to DataHub.


Pain Points with dbt (When It Comes to Governance)

No Ownership Tracking

dbt has no concept of ownership. When dim_customer_metrics breaks at 2 AM, who do you contact? The original author might have left the company months ago. There's no ownership field in dbt.

Documentation is Optional

You can add descriptions in dbt. But nothing enforces it. Half my models have no description. The other half say "customer table" which tells you nothing.

No Awareness of Deprecated Data

dbt doesn't know if an upstream source is deprecated. You can depend on a table scheduled for removal and dbt won't warn you. You find out when production breaks.

No Data Classification

Which models contain PII? Which are safe to share externally? dbt has no built-in tagging for sensitivity or compliance. Teams track this in spreadsheets that go stale.

Governance Happens After the Fact

Even if you try to enforce standards, it's manual. Someone reviews a PR, maybe checks ownership, maybe doesn't. It's inconsistent and depends on who's reviewing.


The Solution: DataHub as the Governance Platform

DataHub solves exactly what dbt lacks. It's a metadata platform that tracks:

  • Ownership: who's responsible for each dataset
  • Domains: which business area owns the data
  • Deprecation status: what's scheduled for removal
  • Tags: PII classification, sensitivity levels
  • Descriptions: rich documentation
  • Lineage: upstream and downstream dependencies

DataHub has the governance metadata. The question was: how do I use it during dbt development?


What I Built

I built dbt-datahub-governance, a command-line tool that validates your dbt models against governance rules stored in DataHub.

The idea is simple:

  1. You have dbt models
  2. You have governance metadata in DataHub (ownership, tags, etc.)
  3. This CLI checks if your models meet the governance standards

How It Works

When you run the CLI:

  1. Reads your dbt manifest - gets all your models from manifest.json
  2. Looks up each model in DataHub - fetches ownership, domain, tags, deprecation status
  3. Checks against your rules - does this model have an owner? a description? is it using deprecated data?
  4. Reports what passed and failed - with clear messages about what to fix

CLI Commands

The main command is validate:

dbt-datahub-governance validate --manifest target/manifest.json
Enter fullscreen mode Exit fullscreen mode

Other useful commands:

  • init - creates a starter config file
  • list-rules - shows all available rules
  • test-connection - checks if DataHub is reachable

No DataHub Yet? Use Dry Run

dbt-datahub-governance validate --manifest target/manifest.json --dry-run
Enter fullscreen mode Exit fullscreen mode

This runs validation without connecting to DataHub. Good for trying out the tool first.


Architecture

Flow chart of how system connect


The Rules

Rule What It Enforces
require_owner Every model must have an owner in DataHub
require_description Models must have meaningful descriptions
no_deprecated_upstream Can't depend on deprecated datasets
require_domain Models must belong to a business domain
require_tags Required tags (like PII) must be present
upstream_must_have_owner Dependencies must have owners too
naming_convention Model names follow your patterns

Each rule can be set as error or warning.


Why DataHub?

I chose DataHub because it already tracks everything I needed:

  • Single source of truth - ownership, domains, tags in one place
  • Clean API - easy to fetch metadata with Python SDK
  • Lineage aware - knows what depends on what
  • Extensible - custom metadata works too

The data was already there. I just needed to use it at the right time.


Quick Start

1. Clone and Install


git clone https://github.com/karan0207/dbt-datahub-cli.git
cd dbt-datahub-cli
pip install -e ".[all]"
Enter fullscreen mode Exit fullscreen mode

2. Create Config

dbt-datahub-governance init
Enter fullscreen mode Exit fullscreen mode

This creates a governance.yml where you enable the rules you want.

3. Connect to DataHub

export DATAHUB_GMS_URL="http://your-datahub:8080"
Enter fullscreen mode Exit fullscreen mode

4. Run Validation

dbt-datahub-governance validate --manifest examples/sample_manifest.json
Enter fullscreen mode Exit fullscreen mode

Sample Output

Validation Output in CLI


Web Dashboard

Prefer a GUI? I also built a Streamlit dashboard:

dbt-datahub-governance dashboard
Enter fullscreen mode Exit fullscreen mode

Streamlit Dashboard

Same validation engine, visual interface.


Before & After

Before After
Ownership unknown Ownership enforced via DataHub
Descriptions are "nice to have" Descriptions required to merge
Deprecated deps break prod Deprecated deps blocked in CI
Governance is manual Governance runs automatically

Try it out: https://github.com/karan0207/dbt-datahub-cli

Questions or ideas for new rules? Drop a comment or open an issue.

Top comments (0)