DrBearhands

Posted on Nov 25, 2021 • Originally published at drbearhands.com

On automated versioning strategies for CI/CD pipelines

It is quite likely that one of your projects will need version numbers at some point in time. You might also want to generate them automatically in a CI/CD pipeline.

The popular way of reporting version numbers is semver. Three numbers, MAJOR.MINOR.PATCH, denoting breaking changes, new backward compatible features, and backward compatible bugfixes, in that order. Unfortunately, this strategy is wrong in such a way it will never result in correct automated versioning systems. At least, not beyond the trivial increment counter.

The problem lies in the words "backward compatible". It is basically impossible to make changes to code that won't break some theoretical dependent project. In many programming languages, adding new functions and leaving the old ones as is does not result in breaking changes, but doing that is often bad practice. New features are bloat. Here's another explanation about the problems with semver that is more in-depth.

In practice, developers label versions based on what they think won't break too many dependent projects. It is useful in practice, but an automated tool has no way of doing that.

If semver cannot be automated, is there a useful strategy that can?

A more "correct" versioning scheme would be MAJOR.PATCH.FEATURE, were MAJOR is for changes of which the author thinks will break projects, PATCH is for changes that can break projects in ways the author thinks are niche, and FEATURE is ~~an inverse function of quality~~ for changes that maintain trace-equivalence with previously-existing endpoints. This versioning system of "descending chances of breakage" is not how we usually think about dependencies. We want the numbers that tell us what features can be used first. If I'm using dependency version X and read about a feature introduced in version Y, it should be immediately clear if I can use the feature from Y in X. There is a conflict in using version numbers to denote the feature set, and using version numebrs to describe potential breaking changes.

This leads me to the following: versioning is multidimensional. We must first realize which dimensions are interesting for our use-case and, if we can, derive them automatically.

Let's look at a few strategies that are not semver.

Type-level "semver"

Elm uses automatic "semver", but it is not actually semver. Or rather, it is semver over the metalanguage of Elm types, not Elm itself. "Breaking changes" are changes to any pre-existing type definitions, MINOR changes are new type declarations, and PATH is no changes in type declarations.

For instance, if we previously commited:

seven : Int
seven = 7

then

seven : Int
seven = 3

will be a PATH version bump.

Type level semver indicates what will not break at compile time.

Commit SHA

Using commit hashes is probably the most precise way to specify dependency versions. It is also stateless, which comes with its own benefits. You can use it with shallow clones, for instance.

Unfortunately it is also a bit inflexible. It does not give you a notion of backward compatibility. If there is a security patch you won't be able to pull it in automatically. If you have two dependencies that share a sub-dependency but require different but compatible versions of it... tough shit.

Furthermore, it's hard to read by humans, and given only two version descriptors, it is unknown which one is the latest.

Commit SHA are about the best identifiers you could have. Assuming you don't change your repository's history, which will lose references.

git describe

Another way tool that is available to us is git describe. This commands spits out TAG-OFFSET-gCOMMIT, where TAG is the most recent (manually created) tag, offset it the number of commits since that tag, and COMMIT is the current (abbreviated) commit SHA.

This has an advantage over simple commit SHAs in that they accurately allow to compare version numbers. We might consider TAG to be feature information, and OFFESET or gCOMMIT to be compatibility information. It is still as inflexible as simple commit hashes, unfortunately.

The biggest problem with this approach is that it requires pulling the repository up to unknown depth. In a CI/CD pipeline you would generally pull only a shallow copy of the repository. Why slow down your jobs by downloading every commit since the beginning of time? Well, you rather have to with this strategy, or git describe might suddenly return something unexpected if the tag is more commits away than the clone depth.

Git describe gives us stateful identity with a partial order, potentially even a total order if tags have a total order. This strategy also only works assuming no history changes in your repository.

Release/build date

Dates are nice for humans, but horribly inaccurate. The downsides are obvious: no compatibility information, no support for branches. Also, dates are subjective at best due to timezones, daylight savings time, and other more obscure changes to our timekeeping standards.

You can get the date of a commit using

git show -s --format=%ci <commit>

Alternatively, you may be able to use the date when a commit was pushed to a central repository, provided you can configure said repository. That solves the subjectivity problem.

Dates technically only offer a preorder, but result in total order quite often.

Not needing versions

This is a bit of a copout, but there are situations where you can avoid version numbers altogether. The Unison language uses hashes as identifiers, and as such identifiers point to exact objects. Adding version information would be redundant. Except perhaps as a global restriction.

Usage

So, no solution is perfect, and I'm sure there's more possibilities than this. How do we chose what is right for our project?

If you're making a library, I suggest following the standards people expect. Usually that will be semver. Haskell has Haskell PVP. It's likely that there are a few other standards that I'm not aware of. Is cases where you're supposed to denote backward compatibility, just give up on automated versioning. This is a fuzzy quantitative problem that requires a human. It is more important to match people's assumptions than to automate versioning.

In general, I'm rather skeptical of type-level "semver". It prevents type mismatches, not behavioral changes. Type mismatches cause failures at compile time. Sure, the CI pipeline breaks, but that's what it's for. Unnoticed changes in runtime behavior are far more dangerous, and automatic semver might give us an unwarranted sense of security about them when all we've done is appease the CI pipeline. In fact, for the purpose of catching potential security issues, it would be best if compile-time errors bump only the lowest version pin possible. Compile-time failures can be caught before going into production. They are not something we need extra protection against in the form of

You should still use type-level "semver" when distributing through channels that expect it.

But what if you're not making libraries?

There are too many variables to give a single correct answer. I will insteads detail an example for choices I made on a project of my own.

By example

The project I want to version is a specification in the form of text in a pdf document. What kind of changes can I expect to make?

Since this is a specification, any addition of features is equally breaking. Implementations would have to support the new feature, therefore their implementation would break. However, we can perhaps subdivide changes in a more linguistic approach.

First there is semantic changes. The intention of the document changes. The Bedeutung from Frege's Über Sinn und Bedeutung.

Then there is changes in wording without changing the intent of the document. Frege's Sinn.

Finally, there's plain old typos.

Unfortunately there is not automated tool that can reliably tell such changes appart. A comma can be an innocent mistake or completely change the meaning of a sentence. A repeated "not" can be a real double negation or word that accidentally got typed twice in editing. If the potential interpretation of a sentence does not change, why would I even release a new version?

It appears sensible to consider any change a breaking change. If typo's are so frequent they cause too many major version bumps, I just won't release typo corrections as often. So, semantic versioning (or a similar concept) does not seem like a good choice for this project.

Going back to why I want version numbers

People should easily be able to tell who has the newest version. I need a partial order, even better if it's total.
I want to be able to find the source/repository state of a specific document.
I thought about distringuishing between large changes and small changes, but thought better of it. Any change may be breaking, so let's act like it.

Commit hashes are insufficient because I want people to immediately know which version of a document is the latest.

Commit dates would work fine for the moment. I only have 1 branch that gets released in the wild, only work from my own timezone, and don't release more often than once daily. But I don't want to be forced to maintain that situation indefinitely.

Git describe would be nice, but my CI/CD pipeline is using shallow clones, so versioning may suddenly break if the last tag is farther away than the clone depth. I could parse the result of git describe and throw an error if it is unexpected. That way bad versions don't make it to release, but how would should such a broken pipeline be fixed? Bump the release number? Increase clone depth? Neither option I particularly like.

But versioning is a multidimensional problem, so I'm using two strategies. Commit hashes and commit dates. That will give me both identifiers and comparability (most of the time).

In practice

I took the above concepts into practice with the following steps:

1) Make sure the version number is being used in the document.

I'm creating the document in latex and using this trick

2) Adapt the CI/CD pipeline, my build stage:

build:
  stage: build
  artifacts:
    paths:
    - ${DOC}.pdf
  image: <my latex container>
  script:
    - export DATE=$(echo ${CI_COMMIT_TIMESTAMP} | sed 's/T.*//')
    - export VERSION=${CI_COMMIT_SHA}
    - pdflatex ${ARGS} -jobname=${DOC} "\def\version{${VERSION}} \input{${DOC}.tex}" -draftmode
    - pdflatex ${ARGS} -jobname=${DOC} "\def\version{${VERSION}} \input{${DOC}.tex}" -draftmode > /dev/null
    - bibtex ${DOC}
    - pdflatex ${ARGS} -jobname=${DOC} "\def\version{${VERSION}} \input{${DOC}.tex}" -draftmode > /dev/null
    - pdflatex ${ARGS} -jobname=${DOC} "\def\version{${VERSION}} \date{${DATE}} \input{${DOC}.tex}"

DOC and ARGS are defined alsewhere in the .gitlab-ci.yml file

Latex requires you to repeat build commands. Don't ask, I don't know either.

That's it!

In conclusion

It is harder that one would expect to do automated versioning right. The main problem is statelessness. Versions generally depend on information from past commits, but that information may not be available for efficiency. It is also quite difficult to determine useful, correct semantics for version numbers.