I was trying to implement some new feature in some larger somewhat messy project (RETURNN but not so relevant).
So I created a new branch, also made a GitHub draft PR (here), and started working on it.
While working on it, it turned out that I needed several other things fixed or extended first. There was not really a clear boundary of these other things (i.e. whether they are to be considered as totally independent), nor was it really clear from the beginning on what actually is needed. This only became clear more and more along the way.
In other cases, this was still not too much, and easy to manage. But for this particular thing, it became quite extreme.
So now I have 90 commits (when you look at the PR at some later point, maybe less because I already cleaned up).
What is a good strategy now to handle them? Some of them can be squashed together. The PR can also be split up because they touch upon several things, so each thing maybe could be moved over into an individual branch (and PR). But they partly also depend on each other in the way that they use a newly introduced feature/function, or tests would fail without the other thing, or so.
Squashing can potentially already clean up a lot. Because there are some commits which introduce TODO comments, and then later implement this. Or sometimes I made some change, and then later I rewrote the code to do it differently.
Is there some tool which can automatically tell which commits are good candidates to be squashed together? E.g. because they modify code in the same function or in nearby code.
My usual strategy (on smaller PRs) is that I first reorder the commits logically as far as possible, and then in the next step I squash commits together.
When there are other unrelated changes in the PR, I reorder them to be first. And then I create a new branch (and PR) consisting of the first set of changes. Then I wait for the test suite to run through for the new PR. Then I merge the PR into master. Then I rebase the main PR on master. And I repeat doing this process. It can require further work when the tests do not run through on some individual new PR.
This way of working takes lots of time, and I need to wait often. The test cases take about 10 minutes to finish on GitHub CI. So for everything I do, I need to wait 10 minutes. So most of the time I would just wait. This often seems very unproductive to me.
For the reordering and squashing and potential further changes, I use the Git GUI in PyCharm, by interactively rebasing again and again.
Am I doing something suboptimally? How can I do it better?
Are there other tools which can help me somehow?
Are there good strategies in general how to deal with this situation?
Maybe I could have avoided the situation somehow by working in a different way? Usually the branches (PRs) are much smaller, so this is much simpler to handle then. Should I actively try to keep changes minimal in some branch, so I don't end up in such situation?
I guess one thing which could avoid such situation is to reduce the amount of technical debt in this project. This is one of the reasons that it was not clear from the beginning what is actually needed.
Actually after formulating all this, I remembered that I did ask already a similar question on StackOverflow before on "How to find pairs/groups of most related commits" (my memories are bad...).
And actually I also did implement a similar script already here, description of the algorithm here. The algorithm is quadratic in the number of commits, and somewhat slow. So maybe already too slow here.
(Note this is a cross post from Reddit. But so far there is no real good solution.)
(I asked before whether dev.to is a good fit for a question like this. I usually use StackOverflow but usually they do not want such opinion-based questions.)