Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a pre-commit hook to reject large files #2699

Merged
merged 4 commits into from
Jun 11, 2021

Conversation

gerashegalov
Copy link
Collaborator

@gerashegalov gerashegalov commented Jun 10, 2021

Prevents large files as in #2644.

Contributes to #2420, #2674

Signed-off-by: Gera Shegalov gera@apache.org

Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
@gerashegalov gerashegalov added the build Related to CI / CD or cleanly building label Jun 10, 2021
@gerashegalov gerashegalov added this to the June 7 - June 18 milestone Jun 10, 2021
@gerashegalov gerashegalov self-assigned this Jun 10, 2021
@gerashegalov
Copy link
Collaborator Author

gerashegalov commented Jun 10, 2021

Developer's use case:

$ fallocate -l 51M tools/src/test/resources/bigfile.txt
$ git add tools/src/test/resources/bigfile.txt
$ git commit -m test

Reject files over 50M....................................................Failed
- hook id: check-added-large-files
- exit code: 1

tools/src/test/resources/bigfile.txt (52224 KB) exceeds 51200 KB.

Even if developers don't enable pre-commit checks and accidentally push PRs with large files
e.g. after git commit -a -m test --no-verify, the CI can catch this by utilizing the same config and explicitly running the same hook in shell independently of git with --all-files flag

$ pre-commit run check-added-large-files --all-files
Reject files over 50M....................................................Failed
- hook id: check-added-large-files
- exit code: 1

tools/src/test/resources/bigfile.txt (52224 KB) exceeds 51200 KB.

@abellina
Copy link
Collaborator

This is a nice change, my only question is about the enforce-all argument:

--enforce-all - Check all listed files not just those staged for addition.

Does it mean if I have a file that I didn't run git add for, that it would fail? Wonder if it we want this behavior.

@revans2
Copy link
Collaborator

revans2 commented Jun 10, 2021

I really dislike the 50 MiB limit. We should not be checking anything in that big. I would much rather see a 1.5 MiB limit, and allow the files we already have in place to stay.

@gerashegalov
Copy link
Collaborator Author

I really dislike the 50 MiB limit. We should not be checking anything in that big. I would much rather see a 1.5 MiB limit, and allow the files we already have in place to stay.

@revans2 I agree with you. The number comes from the ticket and the current state of the discussion. My main point is to demonstrate how to reuse the same check on the dev machines and CI. We can converge on the size limit here. The default in this hook is 500KiB. 1.5MiB sounds also reasonable to me.

@gerashegalov
Copy link
Collaborator Author

gerashegalov commented Jun 10, 2021

Does it mean if I have a file that I didn't run git add for, that it would fail? Wonder if it we want this behavior.

@abellina no this is only for the use CI use-case where the hook is run outside git with --all-files flag. If you run just git commit locally it does not affect existing files.

However, if we converge on @revans2 's proposal to lower the limit to 1.5MiB for new files without checking existing files I need to make sure that we touch only some safe suffix of the git log: e,g ccd6018 breaks:

$ pre-commit run check-added-large-files --all-files --from-ref ccd6018ff2ee194313389d375eaedcf3388e9226 --to-ref HEAD
Check for file over 1.5MiB...............................................Failed
- hook id: check-added-large-files
- exit code: 1

tools/src/test/resources/spark-events-qualification/join_missing_sql_end (1583 KB) exceeds 1500 KB.
tools/src/test/resources/spark-events-profiling/rapids_join_eventlog2 (2167 KB) exceeds 1500 KB.

but starting with 2f977dc

$ pre-commit run check-added-large-files --all-files --from-ref 2f977dc7367958a6aec0f0bed279e31f683dbe80 --to-ref HEAD
Check for file over 1.5MiB...............................................Passed

is fine.

@gerashegalov gerashegalov changed the title Add pre-commit hook to reject files over 50M Add pre-commit hook to reject large files Jun 10, 2021
@gerashegalov gerashegalov changed the title Add pre-commit hook to reject large files Add a pre-commit hook to reject large files Jun 10, 2021
@abellina
Copy link
Collaborator

sounds good, I'd argue for a limit that allows our current repo to pass CI without any extra sha restrictions (something like 2.5MB).

@gerashegalov
Copy link
Collaborator Author

sounds good, I'd argue for a limit that allows our current repo to pass CI without any extra sha restrictions (something like 2.5MB).

@abellina unfortunately, 2.5MiB won't do it per @revans2's comment #2674 (comment)

$ git ls-files | xargs ls -alhS | head -3
-rw-rw-r-- 1 gshegalov gshegalov   44M Jun  9 11:00 tools/src/test/resources/spark-events-qualification/nds_q86_fail_test
-rw-rw-r-- 1 gshegalov gshegalov   44M Jun  7 17:09 tools/src/test/resources/spark-events-qualification/nds_q86_test
-rw-rw-r-- 1 gshegalov gshegalov  4.6M Jun  7 17:09 tools/src/test/resources/spark-events-qualification/dsAndDf_eventlog

@revans2
Copy link
Collaborator

revans2 commented Jun 10, 2021

sounds good, I'd argue for a limit that allows our current repo to pass CI without any extra sha restrictions (something like 2.5MB).

What files are that big?

$ find . -type f | xargs ls -lSh | head -10
-rw-rw-r-- 1 user user   44M Jun  9 16:10 ./tools/src/test/resources/spark-events-qualification/nds_q86_fail_test
-rw-rw-r-- 1 user user   44M Jun  9 16:10 ./tools/src/test/resources/spark-events-qualification/nds_q86_test
-rw-rw-r-- 1 user user  4.6M Jun  9 16:10 ./tools/src/test/resources/spark-events-qualification/dsAndDf_eventlog
-rw-rw-r-- 1 user user  2.2M Jun  9 16:10 ./tools/src/test/resources/spark-events-profiling/rapids_join_eventlog
-rw-rw-r-- 1 user user  2.2M Jun 10 13:11 ./tools/src/test/resources/spark-events-profiling/rapids_join_eventlog2
-rw-rw-r-- 1 user user  1.7M Jun  9 16:10 ./tools/src/test/resources/spark-events-profiling/malformed_json_eventlog
-rw-rw-r-- 1 user user  1.7M Jun  9 16:10 ./tools/src/test/resources/spark-events-profiling/rp_sql_eventlog
-rw-rw-r-- 1 user user  1.6M Jun 10 13:11 ./tools/src/test/resources/spark-events-qualification/join_missing_sql_end
-rw-r--r-- 1 user user  1.1M Aug 21  2020 ./integration_tests/src/test/resources/Acquisition_2007Q3.txt
-rw-rw-r-- 1 user user 1003K Jun 10 13:11 ./tools/src/test/resources/spark-events-profiling/tasks_executors_fail_compressed_eventlog.zstd

It is all of the files that were just added in as a part of the tools tests. Everything before that was small, and I would argue we should not have allowed the tools files in. We should have flagged it early on and someone should have looked at it and thought do we really need a 44M MiB file checked into the repo.

@tgravescs
Copy link
Collaborator

those files can be compressed as well or we can decide to move them out

@revans2
Copy link
Collaborator

revans2 commented Jun 10, 2021

those files can be compressed as well or we can decide to move them out

That does not fix the problem. Once they are in the repo unless we do surgery to the repo, like we just did, they will be downloaded every time, because the full history is downloaded for each clone. The point of this patch is not to fix what was already done. The point of this is to stop it from ever happening again in the future and making the problem worse

@abellina
Copy link
Collaborator

Never mind (re 2.5MB). I was assuming the output from an earlier comment was all inclusive.

@gerashegalov
Copy link
Collaborator Author

those files can be compressed as well or we can decide to move them out

That does not fix the problem. Once they are in the repo unless we do surgery to the repo, like we just did, they will be downloaded every time, because the full history is downloaded for each clone. The point of this patch is not to fix what was already done. The point of this is to stop it from ever happening again in the future and making the problem worse

If the CI installs pre-commit and adds

pre-commit run check-added-large-files --all-files --from-ref 2f977dc7367958a6aec0f0bed279e31f683dbe80 --to-ref HEAD

we should be good.

Developers should install pre-commit too which will add automatic copyright updates as a bonus

@gerashegalov
Copy link
Collaborator Author

build

@gerashegalov gerashegalov merged commit 80383bb into NVIDIA:branch-21.08 Jun 11, 2021
@gerashegalov gerashegalov deleted the fileSizePrecommit branch June 11, 2021 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Related to CI / CD or cleanly building
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants