DEV Community

Cover image for Designing a Docker-Powered Transform Command with Auto-Versioning in Python
Martin
Martin

Posted on

Designing a Docker-Powered Transform Command with Auto-Versioning in Python

In my previous post, I wrote about DataTracker's storage architecture (hashes, objects, and SQLite metadata). This follow-up is about what I think is the most technically interesting command: transform.

If you have not read the first post, quick context: DataTracker is a local CLI tool for versioning datasets (files or directories) with git-like commands (add, update, history, compare, diff, export, etc.).

This article focuses on one question:

How do you run a data transformation in Docker and still keep version history useful instead of chaotic?


Why transform Exists at All

Most data versioning tools stop at "store versions". That is useful, but in real workflows the interesting part is what happens between versions:

  • cleaning
  • reshaping
  • converting formats
  • running scripts in reproducible environments

I wanted this to be one command, not three manual steps each time.

Without dt transform, the flow looks like this:

  1. Run some custom Docker command manually
  2. Hope the output is written where you expected
  3. Remember to call dt update afterward
  4. Pick a version number and message

That works until you forget step 3 once and history becomes incomplete.

So the design target became: run transform + apply versioning rules in one place.


The Command Surface

At a high level:

dt transform --input-data <path> --output-data <path> [options]
Enter fullscreen mode Exit fullscreen mode

The key options are:

  • --image, --command (required unless using a preset)
  • --auto-track
  • --no-track
  • --dataset-id
  • --version
  • --message
  • --preset
  • --force

The command validates Docker, validates tracker initialization, resolves paths, decides whether output should be versioned, runs the transformation in a container, and then possibly versions the output or even creates a new dataset.


Design Principle: Separate "Run" from "Track Decision"

One thing I changed early was separating concerns:

  • Docker execution stays in docker_manager.py
  • tracking/versioning policy stays in transform.py
  • Click argument parsing stays in commands.py

That split made the logic easier to test and refactor. It also prevented the CLI layer from becoming an unmaintainable if-else block.


Mount Contract and Safety Checks

Internally the command mounts:

  • input at /input (read-only)
  • output at /output

and runs:

docker run --rm -v <input>:/input:ro -v <output>:/output <image> /bin/sh -c "<command>"
Enter fullscreen mode Exit fullscreen mode

By default, I validate that the command references both /input and /output. This catches a surprisingly common user error: the transformation technically succeeds, but writes data to a path that is not mounted back to the host.

If someone knows exactly what they are doing and wants to bypass this, --force disables that check.

This is one of the recurring themes in the CLI design: safe default, escape hatch available.


Auto-Versioning: The Real Core

After Docker runs, the command decides what to do with output history.

The policy is:

  1. If --no-track is set: do not version output.
  2. Else if input path matches a tracked dataset: version output into that dataset.
  3. Else if input is untracked and --auto-track is set:
    • add input as a new dataset
    • then version output into that new dataset
  4. Else: run transform only, no versioning.

This gives predictable behavior for both cautious users and exploratory users.

Decision table

Once I had the rules written down, the behavior became much easier to reason about. In practice, the command reduces to this table:

Input already tracked? --auto-track --no-track Result
Yes No No Run transform and version output into the existing dataset
Yes Yes No Same result as above — input is already tracked, so --auto-track has nothing extra to do
Yes No Yes Run transform only, do not version output
No No No Run transform only, do not version output
No Yes No Add input as a new dataset, then version output into that dataset
No No Yes Run transform only, do not version output
Any Yes Yes Invalid combination, exit with usage error

Why path-based dataset lookup?

I use original dataset paths to infer identity when --dataset-id is not provided. It keeps the command convenient for everyday usage, while still allowing explicit control when needed.

Version increment behavior

For transform-generated versions, I intentionally use fractional increments (+0.1 by default) rather than forcing integer bumps.

Reasoning: many transforms are intermediate processing steps, not "major new source snapshot" events. Keeping the increments smaller prevents version numbers from ballooning unnecessarily and makes the history easier to scan.


Conflict Rules and Flags

Two flags are mutually exclusive by design:

  • --auto-track
  • --no-track

If both are provided, command exits with a usage error.

This might look obvious, but explicit validation here matters because these conflicting options can otherwise produce silent, confusing behavior.

I would rather fail fast than invent precedence rules users have to memorize.


Presets: Turning Repetition into Reuse

The obviously annoying part about transform is the length of the command and the repetition: image, command, and tracking flags are very often the same.

That led to transform presets in .data_tracker/presets_config.json.

A preset stores things like:

  • image
  • command
  • auto-track/no-track
  • force
  • message

Then you can run:

dt transform --input-data ./raw --output-data ./processed --preset clean-sales
Enter fullscreen mode Exit fullscreen mode

and optionally override any preset field from CLI.

Override hierarchy

The rule is simple and explicit:

CLI value > preset value > default value

That gives reusable defaults while keeping one-off runs easy.

Preset management commands

I added a small CRUD interface:

# list presets
dt preset ls
dt preset ls --detailed
# add/remove presets
dt preset add <name> --image ... --command ... [flags]
dt preset remove <name>
Enter fullscreen mode Exit fullscreen mode

The detailed listing is intentionally human-readable, so you can quickly review what a preset actually does before using it.


Failure Cases I Handled Explicitly

The most important ones:

  • Docker not installed
  • tracker not initialized
  • input path does not exist
  • transform command succeeds but output directory is empty
  • preset missing or malformed
  • tracking/version update fails after successful transform

I also added rollback behavior for one edge case: if --auto-track adds a dataset but the transform fails immediately after, the command removes the auto-added dataset to avoid leaving junk history.

This is not truly transactional, but it keeps state cleaner than a naive implementation.


Example Workflow

# initialize
dt init

# track raw input
dt add ./data/raw.csv --title sales --message "raw export"

# run transform and auto-version output
dt transform \
  --input-data ./data/raw.csv \
  --output-data ./data/cleaned \
  --image python:3.11-slim \
  --command "python /input/clean.py --output /output/clean.csv" \
  --message "normalize + remove nulls"

# inspect what changed
dt history --name sales
dt compare --name sales # auto-compares latest two versions by default
Enter fullscreen mode Exit fullscreen mode

What Is Still Imperfect

transform works well for my scope, but a few things are intentionally out of scope for now:

  • no remote execution (local Docker only)
  • no pipeline DAG orchestration (single command execution)
  • no built-in preset edit command (remove + add is currently enough)

I prefer this to overbuilding features before there are real users.


Lessons Learned from Building It

  1. Command design is mostly policy design. The hard part is not running Docker; it is defining clear, deterministic rules for when to version and how.
  2. Safety checks are worth a few extra lines. Validation around mount paths and conflicting flags prevented multiple confusing runs.
  3. Defaults should be opinionated, not rigid. Good defaults (/input, /output, auto behavior) plus escape hatches (--force, explicit --dataset-id, custom --version) make the tool usable for both normal and advanced scenarios.

Repo

Source code: github.com/martin-iflap/DataTracker

This project is open source under the MIT License. Contributions and feedback are welcome.

If you have ideas for improving transform (or the overall CLI design), feel free to open an issue or submit a pull request. The main goal of this project for me is to learn more about CLI design and data versioning, so I am very open to suggestions.

Top comments (0)