Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cksum: Accept non UTF-8 inputs #6603

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

RenjiSann
Copy link
Contributor

Fixes #6573

@RenjiSann RenjiSann force-pushed the renji/utf8-cksum-comment branch 6 times, most recently from 8111dc4 to 9eecf25 Compare August 2, 2024 08:58
Copy link

github-actions bot commented Aug 2, 2024

GNU testsuite comparison:

Skipping an intermittent issue tests/tail/inotify-dir-recreate (passes in this run but fails in the 'main' branch)

@RenjiSann RenjiSann force-pushed the renji/utf8-cksum-comment branch 2 times, most recently from b4d46d2 to f2abbae Compare August 3, 2024 10:02
@RenjiSann
Copy link
Contributor Author

RenjiSann commented Aug 17, 2024

@BenWiederhake Could you take a look ? :)

if !last.is_ascii_whitespace() {
break;
}
line_trim = &line_trim[..line_trim.len() - 1];
Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please document what this line is doing, it isn't obvious

Copy link
Contributor Author

@RenjiSann RenjiSann Aug 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a simple implementation of the trim_ascii() function which is not available for MSRV 1.70.

The behavior is extracted in its own documented function in #6654

@@ -1277,3 +1277,27 @@ fn test_non_utf8_filename() {
.stdout_is_bytes(b"SHA256 (funky\xffname) = e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\n")
.no_stderr();
}

#[cfg(target_os = "linux")]
Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why linux only ?

could you please to add a test to inject invalid utf8 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why linux only ?

This is somewhat related to this.

TLDR: With the current MSRV being 1.70, on windows. there is no way to translate between OsStr and &[u8] without going through str (this holds for owned types as well).

could you please to add a test to inject invalid utf8 ?

This is what the test should be doing. Using non UTF-8 characters that would make the old implementation fail because it would try to create Strings from these characters.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no way to translate between OsStr and &[u8]

Yes there is: os_str_as_bytes, just like you also use in the crate itself.

This is what the test should be doing.

No, the test does not check for correct handling of files with non-UTF-8 names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to be sure I get it correctly. In the added test, I should check that the hashes variable get written in a file with a non-UTF-8 name, is this correct ?

If yes, I might need to refactor slightly the test framework in order to be able to write to files with non-UTF-8 names.

@RenjiSann
Copy link
Contributor Author

@sylvestre all your comments are addressed in #6654. If that's still an issue, please tell me, I will refactor the code accordingly.

Copy link
Collaborator

@BenWiederhake BenWiederhake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused.

So which PR would you like us to review?

@@ -1277,3 +1277,27 @@ fn test_non_utf8_filename() {
.stdout_is_bytes(b"SHA256 (funky\xffname) = e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\n")
.no_stderr();
}

#[cfg(target_os = "linux")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no way to translate between OsStr and &[u8]

Yes there is: os_str_as_bytes, just like you also use in the crate itself.

This is what the test should be doing.

No, the test does not check for correct handling of files with non-UTF-8 names.

@RenjiSann
Copy link
Contributor Author

The other PR doesn't address sylvestre's points either (e.g. the linux-only issue)

The test is linux-only because os_str_as_bytes will automatically fail if given non-UTF-8 characters on windows.
This is due to the fact that there is no safe provided way to convert a OsStr to &[u8] on windows, and we are required to go through an intermediary &str conversion.

Improving the handling of windows filesystem exceptions this might be a further enhancement, but I don't think this should be the main point of this PR.

and seems to be a collection of many different PRs, which makes it unnecessarily difficult to review,

True, #6654 is a really big refactor, which solves many problems at once. I think about re-organizing all the changes in smaller PRs, while trying to keep the reasoning clear.

@RenjiSann
Copy link
Contributor Author

So which PR would you like us to review?

Let's review and merge this one first, I eventually added the changes requested which I originally put in #6654.
Next, I will work on splitting #6654 into reasonably-sized changes, and try to document things correctly

@RenjiSann RenjiSann force-pushed the renji/utf8-cksum-comment branch 3 times, most recently from 8d2ea86 to 3533b6e Compare September 17, 2024 16:07
@RenjiSann
Copy link
Contributor Author

As requested, I changed the code to handle non-UTF8 filenames, and I added a test for it.

Previously, I used String::from_utf8_lossy to display the filename on the terminal when needed. I realized this was wrong, because it actually inserts a REPLACEMENT CHARACTER U+FFFD for any non-UTF-8 sequence, when we actually want to omit them completely.
At first, I tried to copy-paste the implementation of String::from_utf8_lossy and adapt it, but I figured out, because of unstabilized internals, that this could not be done without copy-pasting even more code from the stdlib.

The compromise I went with is to just remove the U+FFFD chars from the output of String::from_utf8_lossy. It might produce unwanted behavior in case the character appears in the original filename, but I consider this to be unlikely enough to wait to fix it when the involved String's API is stable (MSRV 1.79 iirc).

Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/tail/inotify-dir-recreate (fails in this run but passes in the 'main' branch)

- add a test for non UTF-8 chars in comments
- add test for non-UTF-8 chars in filenames
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/rm/rm1 (passes in this run but fails in the 'main' branch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cksum: --check can't handle non-UTF-8 in comments
3 participants