Patrick Böker

Posted on Mar 26, 2021

Hi (and an introduction to the Raku CI bot)

#raku #ci #obs #azure

Hi!
I'm Patrick Böker, patrickbkr on GitHub, patrickb on IRC. I'm a software developer living in Germany near Stuttgart. I have dabbled with many different programming languages and tech stacks. Professionally I'm mostly working with Java, VB.net, VBA and Raku. In my free time I'm involved in the development of the Raku programming language itself.

I plan to write about any IT related topics I feel are worth sharing, but as I am currently most involved in the Raku language, I expect to be writing mostly about things related to it, but don't be surprised if a post covering some other topic slips in.

I was recently awarded a grant as part of the Raku Development Fund by the Perl Foundation to work on the Continuous Integration (CI) pipeline of the Rakudo compiler. Raku*do* is the rather fitting name of the compiler and runtime that runs Raku and is currently developed in unison with the language itself. I try to not mix up the two, Raku and Rakudo, but I will fail from time to time, so don't be too confused when I use them wrongly. I will post about my progress on that project here.

The rest of this post is an introduction to that project.

The Problem

It all started with an issue in our dear Problem Solving repository. That repository is the tool by which larger changes to the Raku language and related topics are discussed, reviewed and approved. This is akin to PEPs in Python or JEPs in Java.

That issue states that the development process of Rakudo, has a set of deficiencies which more than once resulted in broken releases and our CI failing on master for longer periods of time. That hurts.

Deficiencies of our process include:

Our CI only covers a small set of environments
The CI is not entirely reliable
There are flappers in our test-suite
People can and do push changes directly to master, sometimes breaking it

The Plan

The software I intend to write as a part of the solution is named RakuCIBot or RCB for short. So let's address the problems one by one.

1. Our CI only covers a small set of environments

The Raku community doesn't have the financial resources and manpower to setup and take care of a custom CI solution on our own servers. So we try to make use of the free offerings some CI providers kindly provide to open source projects. It's not strictly necessary for the solution to be free, we might be able to mount a moderate monthly cost for a CI offering, but there are free offerings for open source projects out there, so it makes sense to go with the free solutions. Our CI, as it currently stands, relies solely on Microsofts Azure CI as they offer agents running all three major operating systems (Windows, Linux and MacOS) and do not currently impose resource limits on projects. We did previously use TravisCI, AppVeyor and CircleCI, but none of them support all three major OSes and don't impose usage limits. (Some of those older CI integrations haven't been shut down yet, but we plan to do so soonish).

When developing a programming language that directly interfaces with and abstracts the low level APIs of the operating system and (in case of our JIT compiler) even the hardware itself, one does want coverage of all the hardware, operating systems and libraries we can run on. Our current CI, which only runs on x86-64 and only on Windows, Linux and MacOS doesn't suffice. We do want testing on, amongst others and in no particular order:

OpenBSD
AIX
SystemZ
ARM (especially now that Apple moved over to ARM)
musl (used in Alpine Linux which is very popular in containers)
glibc 2.17 (that's the oldest version in use in a major disto - CentOS 7)
Something Debian-y
Something RedHat-y

There currently exists no CI offering that provides support for all of these. Especially the rarer hardware platforms are difficult to find support for. There is one rather strange exception though. The Open Build Service, developed and hosted by SUSE, provides a free to use (and open source!) build infrastructure to be used for building packages for a range of Linux distributions on a very wide range of hardware platforms. The OBS is no CI platform, it's intended to be used to build (and deploy) distribution packages. But technically it does run tests as part of its normal work of creating packages and we do have approval to bend the OBS service into a CI platform for Rakudo.

So the plan is to rely on two CI offerings, AzureCI for MacOS and Windows, and OBS for Linux on all sorts of different Linux distributions and hardware platforms.

The Rakudo core projects use Git and GitHub for source code management. So it's desirable to have the CI integrated in GitHub. OBS' APIs make it tricky to use them as a CI directly and they do not offer any integration into GitHub (remember: OBS has no focus on being a CI platform currently), so we need a software that acts as a middle man and does the bending that's necessary to use it as such. That's one of the tasks the RCB I plan to develop should perform.

Extending RCB to also test on other CI offerings should possible. SourceHut (it's a paid service) for example would give us support for several BSDs.

2. The CI is not entirely reliable

The CIs we have used up to now have all been occasionally failing for reasons unrelated to the actual tests. Reasons include the CI platform being offline and the GitHub APIs being temporarily unavailable for the CI provider. As CI providers usually don't retry a build once it's failed, such failures are annoying. They alarm committers of a problem where there is none or block a PR from being merged. The only solution I could think of is to introduce a middle man that acts like a Message Queue. That's another task the RCB should perform. It should be able to receive push notifications of the respective APIs of GitHub and the CI backends to keep the CI processing responsive, but also pull for changes to be resilient to temporary fallouts. There should be a focus on making sure RCB will not accidentally miss or loose messages itself. Otherwise we'll have a system as unreliable as before, but more complex.

3. There are flappers in our test-suite

Flappers are tests in our test-suite that are not entirely reliable. Such a test is for some reason sometimes falsely negative. The rarer the test fails to succeed without reason, the more difficult it is to reproduce and fix. Preferably we want to fix flappers, but sometimes it's not easily possible to do so (missing manpower being the usual reason). So we should make sure flappers don't hurt the CI. The simple solution is to just re-run a CI run if it fails. That will help with the stability of our CI, but is counter productive for actually fixing the flappers. When, in the case of a failed CI run, we just re-run the tests and don't do anything else, we basically hide an error. Error-hiding, aka CATCH { #`[Just ignore.] }, is a famous anti-pattern we should strive to avoid. So what we'll do instead is set up a list, where flappers are manually kept book of. That list is read by RCB and only CI failures that match one of the noted flappers are then re-run. In addition RCB will keep track of the failures for each flapper so we can get a feel for how often the flappers hit and ideally get a hint for fixing them.

4. People can and do push changes directly to master

We'll only allow pushing changes to master that have been successfully CI tested via a pull request (PR). A PR is a GitHubby thing where one proposes a change for review / CI test and which can be merged in a separate step. For this workflow to work it's absolutely necessary that our CI is reliable, otherwise PRs can and will become blocked by false negative CI results, providing an endless source of annoyance for our dear developers. We do not want that. They are already tormented on the behalf of our users. We should spare them additional torture by the infrastructure.

To ease the pain of a more complex process of getting changes into the master branch, I want the RCB to automatically merge PRs when requested to do so (via a magic word in a comment of the PR) and the CI tests are successful. Then a PR submitter doesn't need to revisit her PR later to merge it.

What's the state of things?

To reduce the chances of unforeseen difficulties as soon as possible, I tried to approach the parts of this project with the largest potential for problems first. In my experience it's external components out of my control that are the most trouble. External components of RCB are the CI backends, namely AzureCI and OBS and GitHub. So I started by looking into those first.

Interfaces: GitHub

The GitHub interfaces gave me the least trouble of the three. There is some good documentation that even includes a tutorial on how to integrate a CI server and there already is a Raku GitHub API library. I started extending the functionality of that library to also support the Checks and Pulls APIs. I'm not done with them yet, there is more to be implemented, e.g. extending the Issue API to support comments and implementing support for webhooks.

Interfaces: AzureCI

The AzureCI documentation is rather messy. In some part that's surely caused by the huge scope that the Azure services cover. But I never the less find it difficult to get at the information I'm interested in. The relevant subproject of Azure is Azure Pipelines, which is part of Azure DevOps. One specific bit of information I had almost thought impossible to get at is the build logs of the individual jobs in a stage. I finally found out that the only(?) way to access those logs is via the Timeline API. That API will return the individual bits of the log of a build (which is usually made up of multiple jobs that run in parallel) in temporal order. Using the parent IDs each individual log bit contains, (they are not documented at all) its possible to reconstruct the build logs of the individual jobs as they are seen on the webinterface (that link will be dead in a few weeks, just ignore then).

Otherwise I think all the necessary bits to control AzureCI pipelines are there. I hope it's now a SMOP - in the literal sense.

Interfaces: OBS

OBS has an API. It's rather straight forward and smallish. It's necessary to understand how OBS itself works for the API to make sense though, as it rather literally maps to the workflows in OBS itself. Also it's an XML only API, JSON is not supported. There is one unfortunate limitation in how OBS (and thus its API) works. In OBS it is not possible to retrieve the build log of a package apart from the most recent one. This has the consequence, that build jobs will have to happen serially, one after the other, instead of in parallel.

Core Architecture

I fleshed out the architecture of the core logic of the application already. I'll explain it in an upcoming post.

Next steps

Set up the frame of the application. That includes setting up a skeleton application with the initial setup stuff to get a do-nothing-application with some tests working, a container pipeline and a deployment script to push it to a server.
Get enough of the GitHub integration working to have RCB act as a GitHub CI backend.
Get the OBS integration working.
Flesh out the core logic.

More to come.

DEV Community