Skip to content

Commit

Permalink
Add experimental 'git survey' builtin (#5174)
Browse files Browse the repository at this point in the history
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
  • Loading branch information
dscho committed Sep 26, 2024
2 parents 82b9b16 + bf6c6b6 commit 0a67293
Show file tree
Hide file tree
Showing 10 changed files with 1,136 additions and 0 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,7 @@
/git-submodule
/git-submodule--helper
/git-subtree
/git-survey
/git-svn
/git-switch
/git-symbolic-ref
Expand Down
2 changes: 2 additions & 0 deletions Documentation/config.txt
Original file line number Diff line number Diff line change
Expand Up @@ -536,6 +536,8 @@ include::config/status.txt[]

include::config/submodule.txt[]

include::config/survey.txt[]

include::config/tag.txt[]

include::config/tar.txt[]
Expand Down
14 changes: 14 additions & 0 deletions Documentation/config/survey.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
survey.*::
These variables adjust the default behavior of the `git survey`
command. The intention is that this command could be run in the
background with these options.
+
--
verbose::
This boolean value implies the `--[no-]verbose` option.
progress::
This boolean value implies the `--[no-]progress` option.
top::
This integer value implies `--top=<N>`, specifying the
number of entries in the detail tables.
--
83 changes: 83 additions & 0 deletions Documentation/git-survey.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
git-survey(1)
=============

NAME
----
git-survey - EXPERIMENTAL: Measure various repository dimensions of scale

SYNOPSIS
--------
[verse]
(EXPERIMENTAL!) `git survey` <options>

DESCRIPTION
-----------

Survey the repository and measure various dimensions of scale.

As repositories grow to "monorepo" size, certain data shapes can cause
performance problems. `git-survey` attempts to measure and report on
known problem areas.

Ref Selection and Reachable Objects
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In this first analysis phase, `git survey` will iterate over the set of
requested branches, tags, and other refs and treewalk over all of the
reachable commits, trees, and blobs and generate various statistics.

OPTIONS
-------

--progress::
Show progress. This is automatically enabled when interactive.

Ref Selection
~~~~~~~~~~~~~

The following options control the set of refs that `git survey` will examine.
By default, `git survey` will look at tags, local branches, and remote refs.
If any of the following options are given, the default set is cleared and
only refs for the given options are added.

--all-refs::
Use all refs. This includes local branches, tags, remote refs,
notes, and stashes. This option overrides all of the following.

--branches::
Add local branches (`refs/heads/`) to the set.

--tags::
Add tags (`refs/tags/`) to the set.

--remotes::
Add remote branches (`refs/remote/`) to the set.

--detached::
Add HEAD to the set.

--other::
Add notes (`refs/notes/`) and stashes (`refs/stash/`) to the set.

OUTPUT
------

By default, `git survey` will print information about the repository in a
human-readable format that includes overviews and tables.

References Summary
~~~~~~~~~~~~~~~~~~

The references summary includes a count of each kind of reference,
including branches, remote refs, and tags (split by "all" and
"annotated").

Reachable Object Summary
~~~~~~~~~~~~~~~~~~~~~~~~

The reachable object summary shows the total number of each kind of Git
object, including tags, commits, trees, and blobs.

GIT
---
Part of the linkgit:git[1] suite
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -1318,6 +1318,7 @@ BUILTIN_OBJS += builtin/sparse-checkout.o
BUILTIN_OBJS += builtin/stash.o
BUILTIN_OBJS += builtin/stripspace.o
BUILTIN_OBJS += builtin/submodule--helper.o
BUILTIN_OBJS += builtin/survey.o
BUILTIN_OBJS += builtin/symbolic-ref.o
BUILTIN_OBJS += builtin/tag.o
BUILTIN_OBJS += builtin/unpack-file.o
Expand Down
1 change: 1 addition & 0 deletions builtin.h
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,7 @@ int cmd_sparse_checkout(int argc, const char **argv, const char *prefix, struct
int cmd_status(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_stash(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_stripspace(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_survey(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_submodule__helper(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_switch(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_symbolic_ref(int argc, const char **argv, const char *prefix, struct repository *repo);
Expand Down
Loading

0 comments on commit 0a67293

Please sign in to comment.