Albert Zeyer

Posted on Sep 2, 2021

How to cleanup a branch (PR) with huge number of commits

#help #discuss #github #git

I was trying to implement some new feature in some larger somewhat messy project (RETURNN but not so relevant).

So I created a new branch, also made a GitHub draft PR (here), and started working on it.

While working on it, it turned out that I needed several other things fixed or extended first. There was not really a clear boundary of these other things (i.e. whether they are to be considered as totally independent), nor was it really clear from the beginning on what actually is needed. This only became clear more and more along the way.

In other cases, this was still not too much, and easy to manage. But for this particular thing, it became quite extreme.

So now I have 90 commits (when you look at the PR at some later point, maybe less because I already cleaned up).

What is a good strategy now to handle them? Some of them can be squashed together. The PR can also be split up because they touch upon several things, so each thing maybe could be moved over into an individual branch (and PR). But they partly also depend on each other in the way that they use a newly introduced feature/function, or tests would fail without the other thing, or so.

Squashing can potentially already clean up a lot. Because there are some commits which introduce TODO comments, and then later implement this. Or sometimes I made some change, and then later I rewrote the code to do it differently.

Is there some tool which can automatically tell which commits are good candidates to be squashed together? E.g. because they modify code in the same function or in nearby code.

My usual strategy (on smaller PRs) is that I first reorder the commits logically as far as possible, and then in the next step I squash commits together.

When there are other unrelated changes in the PR, I reorder them to be first. And then I create a new branch (and PR) consisting of the first set of changes. Then I wait for the test suite to run through for the new PR. Then I merge the PR into master. Then I rebase the main PR on master. And I repeat doing this process. It can require further work when the tests do not run through on some individual new PR.

This way of working takes lots of time, and I need to wait often. The test cases take about 10 minutes to finish on GitHub CI. So for everything I do, I need to wait 10 minutes. So most of the time I would just wait. This often seems very unproductive to me.

For the reordering and squashing and potential further changes, I use the Git GUI in PyCharm, by interactively rebasing again and again.

Am I doing something suboptimally? How can I do it better?

Are there other tools which can help me somehow?

Are there good strategies in general how to deal with this situation?

Maybe I could have avoided the situation somehow by working in a different way? Usually the branches (PRs) are much smaller, so this is much simpler to handle then. Should I actively try to keep changes minimal in some branch, so I don't end up in such situation?

I guess one thing which could avoid such situation is to reduce the amount of technical debt in this project. This is one of the reasons that it was not clear from the beginning what is actually needed.

Actually after formulating all this, I remembered that I did ask already a similar question on StackOverflow before on "How to find pairs/groups of most related commits" (my memories are bad...).
And actually I also did implement a similar script already here, description of the algorithm here. The algorithm is quadratic in the number of commits, and somewhat slow. So maybe already too slow here.

(Note this is a cross post from Reddit. But so far there is no real good solution.)

(I asked before whether dev.to is a good fit for a question like this. I usually use StackOverflow but usually they do not want such opinion-based questions.)

Top comments (15)

Albert Zeyer • Sep 2 '21

Hm, there was some post here, but now it's gone, while I was already answering. Here my answer:

We already do trunk-based development. And also we have lots of CI tests. And exactly of this, I was not merging it in before because still some tests were failing. In the end after everything was fixed, it turned out to be a huge number of commits. And this is the situation now. I.e. I want to split this up.

Usually this does not happen like this, and changes turn out to be smaller. But because we require that tests are always passing, one particular change, fix, or extension needs follow-up fixes of other things.

Benjioe • Sep 2 '21

Do you have some automatics test that's taking long times ?

Albert Zeyer • Sep 2 '21

There are about 50 GitHub CI jobs, each running a lot of different tests. The longest job takes about 10 minutes but most take only 1-3 minutes. But overall until GitHub runs through all of them, it takes maybe 12 minutes or so, once it started, because many run in parallel.

These are run on every push and for every PR. When I keep updating the PR, this can queue a lot of CI runs. And then sometimes I need to wait hours until I see the CI results.

This is a big problem. Because we must know that the tests are passing before we merge something in. So I'm often just waiting.

Benjioe • Sep 2 '21

And could-you rewrite thoses test using Test double to make them quicker ?

Albert Zeyer • Sep 2 '21

I think most tests on their own are already pretty small (most probably take about 0.1 sec or so). I don't think you can optimize much there.

This longer 10 minute test runs PyCharm CLI checks on code style. I haven't really found ways to speed it up.

Maybe with a lot of effort, you could somehow reduce it a bit. But even if you maybe reduce it by half (unrealistic), this problem would still persists, that you often need to wait (although of course to a lesser degree).

I think we are still doing fine with about 10 minutes per CI run. From what I have heard from Google and elsewhere, it can take much longer. And they also manage to still stay productive. So maybe they don't need to reiterate too often for one set of changes, or they don't wait each time for the tests to pass. But this is only because I now want to split it up and actually figure out working independent sets of commits (working = passing tests).

Benjioe • Sep 2 '21 • Edited

Do you run thoses fast test on your own machine before commit them (before CI) ?

Albert Zeyer • Sep 2 '21

Yes sure. I do that a lot. Esp while debugging.

For this cleanup and the search of good subsets of commits which, I maybe don't do it enough. I mostly look at the big list of commits, and find related commits (via my script, or just manually), and then structure and partition it logically, and then push it to individual branches/PRs, and then wait for CI to finish.

Running them locally (before or after the push) could maybe speed this up. But this would maybe distract me from continuing with the partitioning/structuring of the other commits, because then I would wait for the local tests to finish. This is still a couple of minutes, so I would get out of the flow and train of thoughts.

Or maybe I could have a separate checkout of the repo, where I run the tests locally in the background. Then I could continue with structuring or other things in the meanwhile. But managing this seems a bit annoying and still manual work.

This is basically why I posted the question here. Everything what I can think of does not really seem to be optimal. Maybe I'm not knowing about some better workflow, or better tools.

Benjioe • Sep 2 '21

Ok, I think I get it:

Your tests are too slow (but can't be faster) so launching them get you out of the flow and train of thoughts.
You want's independent commits who's can be revert without breaking changes.

So you write the full features with refectaroing without test, send them to the CI to check regressions, fix everythings and when it's good, rewrite history to have independent commits.

Albert Zeyer • Sep 2 '21

Yea, this is what I do. But this is a lot of work, with lots of effort, lots of trial and error, and lots of waiting. And I feel like there should be tools which at least support me on this to some degree, to make it a bit more convenient. Some of these things could be automated, or semi-automated. E.g. there could be a tool or script which automatically figures out these set of commits which can be applied individually without breaking the tests.

Benjioe • Sep 2 '21 • Edited

And If you don't test regression but only code style of the new code (configuring your PyCharm with the same rules as CLI) ... tests are still too slow ?

Albert Zeyer • Sep 2 '21

Yes. Checking the code style is actually slower than the regression tests. But also, the regression are even more important than the code style.

Benjioe • Sep 2 '21 • Edited

Yeah, I was thinking about Open/Close principle so you don't always have to launch regression testing.

So, yeah a tool like you do or like gitflow cli with a set of command to fix CI errors while keeping your history clean. (but I don't know any :'().

Or git rebase --autosquash

Benjioe • Sep 2 '21

Yes, I deleted it because your on a open source project, so that's not relevant ...

Albert Zeyer • Sep 2 '21

Yes? What was not relevant about it? You mentioned trunk-based development, which is relevant. I thought also your other points might have been valid options.

Benjioe • Sep 2 '21 • Edited

Yeah, you true, so that's was :

Maybe you can :

Create a new branch from master when you need unrelated changes, merge it and rebase your feature branch.
Or use Trunk Base Development where your branches lives 1 day max
Ask yourself, do your really need a clean history ? Or good comments are good enough ? (not for code-maat)

View full discussion (15 comments)