Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
list: improve performance with many topics
On a test repo with 100 topic branches: $ git init test $ cd test $ git commit --allow-empty -m root $ git topics setup $ for i in $(seq 1 100); do git topics start $i; done Before: $ time git topics > /dev/null real 0m2.344s user 0m0.547s sys 0m0.457s After: $ time git topics > /dev/null real 0m0.285s user 0m0.084s sys 0m0.071s There are many subtle things to talk about here... This has long been a pain point for me on certain problematic repos, in particular because `git topics list --all --porcelain` is used by tab completion. While I can easily imagine more "specialized" ways of improving performance for tab completion, the first thing to attack was the `list` command itself. The core problem was that the previous algorithm would loop over every topic branch and invoke `git merge-base` (multiple times!) to determine where that branch was merged. Essentially, this amounted to O(n) calls in the number of topics, and each call is a relatively expensive operation. The fix introduced by this commit instead makes cleverer use of `git for-each-ref` so that the heavy lifting is done by its --merged & --no-merged flags using O(1) calls in the number of topics. While there are certainly fewer overall calls to git subcommands, I'm not 100% clear on why for-each-ref is more efficient than merge-base. Looking at the implementations as of this writing (circa git v2.20.0-rc1), merge-base uses commit-reach.c whereas for-each-ref uses ref-filter.c. The algorithms are hairier than I can be bothered to pick apart right now, but I suspect they're probably quite different. At any rate, the proof is in the pudding, as seen in the above benchmark. Changing this implementation meant reevaluating certain details of the original approach. In particular, the logic formerly contained in the `not_a_topic` function changes in two significant ways: 1. Now that we aren't calling merge-base on each individual branch, we won't be able to recognize the interesting edge case of orphan branches. Using `git checkout --orphan`, it's possible to create an entirely separate root commit such that a branch will have no ancestors in common with 'master' - so it's not *really* a topic branch. This is generally uncommon, though of course git.git exercises such strange cases, as in its 'todo' branch. All this change means is that branches like 'todo' will show up as unmerged topics. I figure that's not too terrible, given the (perceived?) infrequency of this situation. Plus, I'm in good company: even `git branch --no-merged` would have the same output, since it uses for-each-ref & ref-filter.c underneath. One possible workaround is to use the --contains flag to for-each-ref. If you could identify the root commit of the 'master' branch, you could make sure to only list refs that contained that commit. However, the ways I could think to find a suitable commit all seem hackish: * `git rev-list --max-parents=0` gives you *all* the root commits, so you'd still have to figure out which one belongs to 'master'. * `git rev-list --reverse --max-count=1` just gives you the HEAD commit, since the max count is applied *before* reversing the list. I guess you could just pipe it to `tail -1`, but that sort of makes me wrinkle my nose (strong argument, I know). * Might be able to just call --contains with the latest version tag from 'master', thus listing topics forked since the last release. However, this still breaks on the base case, when you haven't done a `git topics release` on the repo yet. Come to think of it, it even breaks if you just started a topic then released without finishing that topic yet. Never mind the fragility introduced by manual tagging and such. So I'll hold off on filtering orphans on the YAGNI assumption. 2. The old implementation had a special `case` clause to filter out refs/*/HEAD. I basically did this because I didn't realize where refs/origin/HEAD was coming from before in my own repos. GitHub sets this when you switch its "default branch", and several of my projects had it set to 'develop'. The real *underlying* thing I think we want to avoid is just any symref in general. We can accomplish this easily using one of the builtin atoms in for-each-ref --format. So really, this is an algorithmic improvement: instead of hard-coding HEAD, we avoid listing any symbolic refs (on the assumption that the concrete ref will be listed regardless). Finally, there has been an interesting performance *regression* when I tried this change out idly on a clone of the git.git repo. Before: $ time git topics list -s - pu real 0m1.059s user 0m0.748s sys 0m0.110s After: $ time git topics list -s - pu - todo real 0m2.822s user 0m2.430s sys 0m0.205s I have yet to be able to reproduce what exactly is causing this. There are few refs to loop through, so it seems to be an interesting case. My first thought was that the commit history is very long on 'master', so checking --merged was maybe slower in that case. However, I have not been able to duplicate this in a vacuum. The moral of the story, as it ever is with performance issues, is that I'll have to keep my eyes peeled for cases that are palpably slow. In the meantime, this commit seems to give a substantial improvement to my current real-world examples.
- Loading branch information