Edit 18.06.2021: Updated with a few notes at the bottom
Edit 30.07.2021: Updated GitHub example with gitlab example and renamed common-util to base-framework as we standardized on this on my current probject (and like that better than common-utils -.-). More notes at bottom
Example code can be found here.
Monorepos has come up from time to time in discussions, especially since many of the big companies use this kind of technique to structure their code. The last project I was at, is by far not that big and we also decided to not take it as extreme as many others have done. This will be a post about how we did it and how it worked for us / what pitfalls we experienced.
First of, we changed to monorepo while the organization was moving code slowly over to GitHub, at the same time, GitHub announced GitHub Actions and GitHub Package Registry. Before this, we had a typical BitBucket + Jenkins (per team) + Nexus all behind Citrix and inaccessible from the internet. The organization itself has some hundreds of developers, but as every team has a lot of freedom when it comes to these choices, we decided to try monorepo.
We only structure OUR product into monorepo, not the whole organization. So this was not an organizational choice, but a team choice.
- Atomic changes. Pull-requests with changes many places, e.g. contract + producer + consumer + documentation. Before we had to do 4x pull-requests and usually days between which made people have to go back and forth between what was done in the contract + producer when taking QA of consumer code then verify documentation.
- Easier to search the code base for examples.
- Common code / reuse. Very easy to create common code in monorepo, just make sure to have good conventions. E.g. we used Spring Boot's Auto Configuration in our common libraries and did not make too many new Maven modules, but bundled many "features" into that common library but with transitive dependencies as provided. I'll add an example of this in the GitHub example.
- Management of dependencies, e.g. we were very invested in Spring Boot, so new updates usually only happened to applications that had changes, not those "stale" applications. With monorepo, we only updated the parent pom and triggered a full redeploy (usually manually since we don't have that many apps). This made it easy to have the latest and most up to date dependencies all the time.
- Changes could be tracked across applications to same commits. Made it easier in fault situations where multiple applications had been changed as part of the same feature.
- Large scale refactoring. When we decided to change formatting conventions to IntelliJ default, we did it for the whole product at once. We did other refactoring as we went along, but none that hit that hard in many applications and libraries.
- One place for code, tools, contracts and documentation. Yes, we even moved our documentation from Confluence to GitHub with Asciidoc and Asciidoc Maven Plugin to update a GitHub Pages site whenever the documentation was changed. We had to get our non-technical people to learn Asciidoc, they catched on easier than anticipated.
- Mostly an independent project in the beginning, so few had dependencies to us. But the end result is a product that basically everyone in our organization has to consume.
- Max 12 developers (I think it was).
- About 20 different applications within our product.
- Very autonomous with a high degree of freedom for our team.
- Everything on Kubernetes and everyone could deploy to production.
- Trunk-based development. PR to master, deploy to dev and prod if tests go green, can optionally go to just dev if you are on a branch with name
- We don't share our Java POJO's between other teams since it creates a dependency on all fields in the contract, not just the ones you use. So it's up to the consumers to implement they're own POJO's with just the fields they require. Following tolerant reader pattern, only fail on breaking changes in things you are dependent on. This made us just make a module for all our internal contracts (contract-json / contract-avro) but it will add a bit of an overhead in the build if they get very big. Internally we share our own contracts since all contract changes affects producers and consumers anyway and triggers redeploys.
- We changed our test strategy from tests and traditional test pyramid where we had unit-tests on class level, component tests that started the context and integration tests for one inspired by Spotify. Reason for this was that since we were iterating rather fast, we often changed code / refactor, this always leads to people pulling their hair on existing unit-tests, while we now moved to tests more on the level of: input + state = output. E.g. Given, when, then on a more functional level since this was our use-cases within every application. This worked great for us, especially to increase our iteration speed since most changes did not change the functionality, just added more or refactored the code.
- We removed usage of external components in our tests, e.g. Embedded Kafka was often used, but this took a lot of time to run and was often error prune (race conditions e.g., we could have fixed it, but we mostly did not get anything out of these tests except for testing Spring Kafka (we just hit the same listener in our regular more functional oriented tests anyway, so only difference was if it was Spring Kafka that did the call to our listener or the test itself)). There was also more need for these kinds of tests when we were new to Kafka, but as we came to be more familiar, these tests never catched faults that would happen in the environment. I wish we didn't add Embedded Kafka to all our apps but just one or two for learning purposes, but you learn as you go.
- Almost no manual test (there was some on the front-end).
- Check out tooling before starting with monorepo! E.g. you don't wish to build the whole project at every change. GitHub Actions supports scoped workflows (e.g. path:
apps/app1/*). This saved us, but I have seen people create their own shell scripts that check the git commit log.
- Use codeowners file to automatically assign people with specific domain/application knowledge. Codeowners can be scoped to path, e.g.:
/apps/app1/* @GitHubUser1. See: GitHub Doc. Also example in the code provided.
- Decide on some conventions early, codestyle, formatting, monorepo convention, testing, etc.. We went pretty heavy on Spring Boot and some basic conventions, this made it easier for anyone to jump into any application, even when Spring Boot was too heavy in some cases but this made all code and applications understandable for everybody (of course, had to learn what the app did, but the style, the libraries, e.g. was the same).
- I've heard Gradle might be better for Monorepos, Bazel definitely is but that was too big of a leap for us at that time.
- If you want integration tests, maybe they can be more of the common nature? Often you use an abstraction on top, e.g. Hibernate, Spring Kafka, Kafka-Client, so instead of re-testing these libraries in every application, you could make one test module per technology using e.g. Testcontainers to test this if you don't feel comfortable not having these tests, e.g.
test/kafka-integrationmodule. These would not have to run on every application change, but maybe more ad-hoc or when upgrading 3. party dependencies such as Spring Kafka. I won't cover this here.
First decide a convention for the monorepo. In this example, the format will be:
root .github .tools apps bar baz foo docs libs utils-common (changed to base-framework) contracts-json
Contains the workflows for GitHub Action and a file for dependabot. Dependabot is awesome.
Explanations of workflows:
We had one per app, I think this can be shortened down, but this works for us.
To build, we use:
mvn clean package --projects :bar --also-make --threads=2 --batch-mode
This means, run this workflow for project :bar (only that app), one commit can trigger many workflows, e.g. changes to foo and bar will trigger workflow for both, in parallel.
-am is to build all the dependencies you have within your project, so this will build
common-utils (changed to base-framework) and
contract-json, but not
docs or other modules you might have in
libs. See documentation here.
We went for threads 2 because at the moment, that's how many CPU cores you have on GitHub Actions.
Batch-mode is to not print every KB downloaded in console when downloading dependencies.
Contains different tools for your product, e.g. we had a lot of IntelliJ
.http files to call our API's. These were also used as examples to other teams on how to call our APIs but mostly for our own usage.
All applications. I have seen examples where people have used
services etc, but for us, apps were what we went for.
Every app is dependent on
libs/common-utils (changed to base-framework) that contains some code that will be enabled/disabled based on what's on class path and properties (through Spring Auto Config).
Contains all the documentation, internal and external documentation. We even included postmortems but had them in another folder that was not published to GitHub Pages and since our repo was private, only we could access them.
We published using gh-pages branch, but it seems this is not accessible to every repository. Read here. The example is still put up.
All the common libraries.
We had multiple common libraries here, e.g. we had contracts that were only relevant for a few applications. These contracts we got as XSD from external providers. So to avoid having all our apps have to build these contracts (if we had put them in a common module), we made own modules for them that were only included as dependencies in the apps that needed them.
We had a module name test, that was for: applications just relevant for tests (e.g. application that creates Kafka topics before all the other apps start when running locally using docker-compose), possibility for integration-tests (as mentioned above, but we never used it for that but we could / will when the need arises).
We also had multiple other
. folders, like
.docker for all files to run the parts or the whole repo using docker and docker-compose.
After migrating most of our code base to monorepo, developers often hated when they had to work on non-monorepo applications. The developer experience was just, for us, better and easier. This is probably not only because of the new structure on our code, but GitHub Actions has solved many of the negative tooling problems that monorepos often have.
There was a high degree of learning and culture that had to be changed, especially since while we were migrating we could not put our regular development on pause which basically split the team somewhat.
We migrated app by app and this was a bit cumbersome for some parts, for example, we had some common libraries or contracts that were on Nexus and still highly used. So when we copied these to the monorepo, we basically had 2 places to maintain them (since some apps still depended on the ones in Nexus and migrated apps on those in monorepo). This could have been solved more elegantly if all dependencies were accessible from the internet, but the Nexus we had was on our internal internet. If it was accessible, we could add a
settings.xml file to resolve the dependencies in monorepo or the other way around. If monorepo would be master, we would have to publish the dependencies to GitHub Package Registry. Development on those apps that depended on Nexus was also done behind Citrix, so many walls to climb. We ended up duplicating for a while and it worked fine, but an extra inconvenience.
Overall, the decision to migrate to monorepo was the right one given our situation and the team is happier for it. But many things had to be changed for it to work. E.g. our test strategy, it would have worked not changing it, but then it would be tons of Maven modules if we did it the same way as before.
It's been a while since this post was made and I'm currently at a new project and customer. They had already used monorepo to some degree before but it was a bit, how can I say it, overused or misused? E.g. there were a bunch of maven modules for every kind of small lib you could imagine.So one maven module could have a few classes and that was it. This, combined with a lack of code stewardship made the codebase somewhat of a nightmare and when I came in, they were already on the way to migrate everything into a new monorepo that used Kubernetes instead of Liberty.
This made me realize that code stewardship and a combined ownership between the team over multiple years is a must. The last place I was (that is described above) have not gotten that far yet, it's just a few years down the line for them so these problems have yet to manifest itself. So, I'm updating this post as a heads up! Luckily, we have a team that is now getting more autonomous, which hopefully will make it easier to have the code ownership and stewardship over time but consider this, as in a monorepo, the broken window theory might apply to the whole repo (instead of just the current if you have many).
Another note, at the new project we use GitLab instead, and at least at the point of writing the original post, GitHub Actions did not have a manual step so we made a convention to create a dev/** branch if we wanted things out in dev (don't know if GitHub has this feature now), but GitLab has the option to create a manual trigger. We use this to enable deployment of applications to any test environment (yes, back to a few of those again, but still not that many) no matter what branch. I kind of like this, it's easier.
A comment was posted asking for a GitLab example, so I updated the sample based on what we do at the place I am now. This is based on the manual trigger for side branches as well (mentioned above). Another side note, we are migrating a lot of applications and some of them miss a good test suite, so we also have manual triggers for prod deployments on those. This means we can specify which apps are "safe enough" for automatic deploy to prod vs not. Our goal is to have automatic deploy to prod when merging to master, but not all apps are up to it yet.