Fix handling of non-ASCII characters in archive entry file names #18448

fmeum · 2023-05-19T08:26:00Z

When creating a PathFragment from a ZIP or TAR entry file name, the raw bytes of the name are now wrapped into a Latin-1 encoded String, which is how Bazel internally represents file paths.

Previously, ZIP entries as well as TAR entries with PAX headers would result in ordinary decoded Java strings, resulting in corrupted file names when passed to Bazel's file system operations.

Fixes #12986
Fixes bazelbuild/rules_go#2771

fmeum · 2023-05-19T08:37:50Z

src/main/java/com/google/devtools/build/lib/bazel/repository/CompressedTarFunction.java

+      // USTAR tar headers use an unspecified encoding whereas PAX tar headers always use UTF-8.
+      // Since this would result in ambiguity with Bazel forcing the default JVM encoding to be
+      // ISO-8859-1, we explicitly specify the encoding as UTF-8.
+      TarArchiveInputStream tarStream = new TarArchiveInputStream(decompressorStream, UTF_8.name());


There doesn't seem to be an easy way to keep reading legacy tars in Latin-1 (that is, raw bytes) while still being able to distinguish them from UTF-8 names.

In order to preserve backwards compatibility with tars that use non-ASCII, non-UTF-8 file names without PAX headers, I could inject a custom Charset via CharsetProvider that somehow marks all strings it produces in a way that lets us distinguish them from those created by the default UTF-8 charset (e.g. via a WeakIdentityHashMap). Since that would definitely be a hack, I didn't want to do it without getting feedback on this point first.

@lberki Can you take a look here? I'm not quite familiar with encoding magic.

cc @tetromino (since he's been thnking about Unicode recently)

My professional opinion is that I don't know. What would his hack be useful for? I thought that what this change was doing was to parse the UTF-8 data into proper Unicode then convert that to Bazel's weird "UTF-8 bytes in ISO8859-1" encoding, but that seems doable without resorting to somehow tagging the Strings produced by it by (essentially) doing "if valid UTF-8, decode it as such, otherwise go directly to the weird Bazel encoding".

It's more messy than that: The Latin-1 trick only works if all strings are encoded in Latin-1 from source to sink, otherwise the "opaque bag of bytes" property is irrecoverably lost.

But commons-compress breaks this invariant: While tar ustar headers respect the default string encoding (which can also be set via the constructor), the more modern tar pax headers always use UTF-8.

Since Bazel can't determine which encoding was used to produce the file names returned by commons-compress, I found the only non-hacky option to be to force UTF-8 for ustar headers as well, decode UTF-8 and then reencode as Latin-1.

As a result, this change would break users of tars with non-UTF-8 ustar headers. Previously, these were handled correctly via the Latin-1 trick (but any pax header with non-ASCII chars were broken), but now they would be decoded as UTF-8 and thus turn into garbage.

Ideally we could tweak commons-compress to return byte arrays or at least tell us which type of header it parser, but I didn't find a way to do this. We could raise an issue with them, hack around this limitation by tracking the strings via a custom charset or accept that life isn't fair and UTF-8 deserves to be the winner. ;⁠-⁠)

Well, actually, it's even more messy than that, if I understand correctly: the nice thing about the Latin-1 hack is that it works as long as the encoding on the input is the same as the encoding on the output. However, apparently, pax headers insist that they are UTF-8, which means that they constrain the output of Bazel to UTF-8 also.

Now I see where you are coming from: commons-compress doesn't tell whether there are pax headers and thus it doesn't tell you whether UTF-8 decoding was required. So what you could do is to keep track of Strings that were emitted by the Charset, thus you know which Strings went through the UTF-8 decoding and thus undo it. Is this correct?

I don't claim any knowledge of either commons-compress or the intricacies of tar files, but TarArchiveEntry seems to have some methods that have Pax in their name. Maybe those could be used to disambiguate instead of the hack you proposed?

@tetromino Friendly ping

@lberki Since I wanted to know whether the hack would work, I went ahead and implemented it. It does seem to work and I didn't even have to use a weak cache.

@lberki @tetromino Friendly ping, I would really like to get this into Bazel 7. We currently have to maintain a number of hacks in rules_go just to be able to extract a Go SDK.

My gut reaction to the encoding hack was pretty negative. Then I reread the code and description a couple times, and decided that the hack is only ugly because the apache api is insufficiently flexible for bad clients like us that want to do weird things with encodings. I'm ok with the hack provided it is sufficiently explained in comments, and doesn't leak outside this file.

I would like us to be pretty explicit in comments about what the format of the input headers and returned string are. I.e., Javadoc for the class CompressedTarFunction or its decompress() method, saying that USTAR headers are interpreted as latin-1, PAX headers are interpetted as UTF-8, and both types are turned into Java Strings holding the UTF-8 code units that represent the string content, in order to match our string behavior elsewhere.

One more question. We eventually want to fix Bazel's decoding of Starlark source files to be UTF-8 instead of latin-1. Does anything in this PR make that harder? It seems to me this PR is just preserving an existing behavior (USTAR) while fixing an erroneous one (PAX).

I added more extensive comments, including Javadocs for the class.

One more question. We eventually want to fix Bazel's decoding of Starlark source files to be UTF-8 instead of latin-1. Does anything in this PR make that harder? It seems to me this PR is just preserving an existing behavior (USTAR) while fixing an erroneous one (PAX).

I don't think that it would make this harder. It does introduce one or two new places in which ISO_8859_1 would need to be replaced with UTF-8, but those are only related to prefix stripping, not the actual entry decoding logic that interfaces with commons-compress and relies on the custom decoder hack.

lberki · 2023-09-25T11:54:14Z

@tetromino is not available these days. I don't have any context on this pull request so I wouldn't be able to give it the attention it deserves, but let me try to find someone who has all the bits swapped into their brain.

src/main/java/com/google/devtools/build/lib/bazel/repository/BUILD

src/test/shell/bazel/bazel_workspaces_test.sh

src/main/java/com/google/devtools/build/lib/bazel/repository/DecompressorValue.java

When creating a `PathFragment` from a ZIP or TAR entry file name, the raw bytes of the name are now wrapped into a Latin-1 encoded String, which is how Bazel internally represents file paths. Previously, ZIP entries as well as TAR entries with PAX headers would result in ordinary decoded Java strings, resulting in corrupted file names when passed to Bazel's file system operations.

brandjon · 2023-09-28T16:25:33Z

Thanks for the additional javadoc/comments.

fmeum · 2023-09-28T16:39:27Z

@bazel-io flag

fmeum · 2023-10-06T09:36:07Z

@brandjon Is this still on track for getting merged?

meteorcloudy · 2023-10-06T15:29:13Z

@brandjon is still fixing some internal breakages.

fmeum · 2023-10-06T15:52:26Z

@brandjon is still fixing some internal breakages.

Thanks for the update. I didn't know this code is used at Google - isn't it only relevant for external repos? Just curious.

meteorcloudy · 2023-10-06T15:57:20Z

Might be just some irrelevant breakages, Jon is OOO until next Tuesday, I can help merge this.

When creating a `PathFragment` from a ZIP or TAR entry file name, the raw bytes of the name are now wrapped into a Latin-1 encoded String, which is how Bazel internally represents file paths. Previously, ZIP entries as well as TAR entries with PAX headers would result in ordinary decoded Java strings, resulting in corrupted file names when passed to Bazel's file system operations. Fixes bazelbuild#12986 Fixes bazelbuild/rules_go#2771 Closes bazelbuild#18448. PiperOrigin-RevId: 571857847 Change-Id: Ie578724e75ddbefbe05255601b0afab706835f89

…mes (#19765) When creating a `PathFragment` from a ZIP or TAR entry file name, the raw bytes of the name are now wrapped into a Latin-1 encoded String, which is how Bazel internally represents file paths. Previously, ZIP entries as well as TAR entries with PAX headers would result in ordinary decoded Java strings, resulting in corrupted file names when passed to Bazel's file system operations. Fixes #12986 Fixes bazelbuild/rules_go#2771 Closes #18448. PiperOrigin-RevId: 571857847 Change-Id: Ie578724e75ddbefbe05255601b0afab706835f89 Fixes #19671

brandjon · 2023-10-11T14:12:33Z

Sorry, internal presubmits failed and then I went OOO. Thanks Yun for completing the merge!

The fix for Latin-1 encoded files was landed in [6.4.0] after being landed on Bazel [master]. We can conditionally use the old, unhermetic, `tar` workaround on modern Bazel versions. [master]: bazelbuild/bazel#18448 [6.4.0]: bazelbuild/bazel#19765

A feature gate for bazelbuild/bazel#18448

A feature gate for bazelbuild/bazel#18448 Required for bazelbuild/rules_go#3872

The fix for Latin-1 encoded files was landed in [6.4.0] after being landed on Bazel [master]. We can conditionally use the old, unhermetic, `tar` workaround on modern Bazel versions. [master]: bazelbuild/bazel#18448 [6.4.0]: bazelbuild/bazel#19765

* Remove Latin-1 workaround on Bazel 6.4.0+ The fix for Latin-1 encoded files was landed in [6.4.0] after being landed on Bazel [master]. We can conditionally use the old, unhermetic, `tar` workaround on modern Bazel versions. [master]: bazelbuild/bazel#18448 [6.4.0]: bazelbuild/bazel#19765 * Use `versions` rather than `bazel_features` to avoid circular load pattern Due to the polyfilling of `bazel_features` to avoid double step loading in `WORKSPACE`.

fmeum commented May 19, 2023

View reviewed changes

fmeum force-pushed the fix-archive-file-names branch from 754a626 to cef4b5a Compare May 19, 2023 09:04

fmeum marked this pull request as ready for review May 19, 2023 09:17

fmeum requested review from Wyverald and meteorcloudy May 19, 2023 09:17

github-actions bot added awaiting-review PR is awaiting review from an assigned reviewer team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. labels May 19, 2023

fmeum mentioned this pull request May 19, 2023

go_download_sdk: apply extraction workaround to zips on non-windows OSs bazelbuild/rules_go#3563

Merged

meteorcloudy requested a review from lberki June 6, 2023 11:42

fmeum force-pushed the fix-archive-file-names branch from cef4b5a to e408e31 Compare September 5, 2023 06:57

fmeum requested a review from tetromino September 5, 2023 11:10

fmeum force-pushed the fix-archive-file-names branch 5 times, most recently from 20e7202 to ebd3f3c Compare September 6, 2023 08:59

brandjon self-requested a review September 26, 2023 20:14

brandjon self-assigned this Sep 26, 2023

brandjon reviewed Sep 27, 2023

View reviewed changes

fmeum force-pushed the fix-archive-file-names branch from ebd3f3c to c10070b Compare September 27, 2023 12:01

Use custom Charset to preserve non-UTF8 PAX headers

fe1d084

fmeum force-pushed the fix-archive-file-names branch from c10070b to fe1d084 Compare September 27, 2023 12:04

fmeum requested a review from brandjon September 27, 2023 12:20

brandjon approved these changes Sep 28, 2023

View reviewed changes

bazel-io added the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Sep 28, 2023

meteorcloudy removed the awaiting-review PR is awaiting review from an assigned reviewer label Sep 28, 2023

bazel-io removed the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Sep 28, 2023

bazel-io mentioned this pull request Sep 28, 2023

[6.4.0] Fix handling of non-ASCII characters in archive entry file names #19671

Closed

lberki approved these changes Oct 6, 2023

View reviewed changes

copybara-service bot closed this in 10169bb Oct 9, 2023

github-actions bot removed the awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally label Oct 9, 2023

fmeum deleted the fix-archive-file-names branch October 9, 2023 07:46

fmeum mentioned this pull request Oct 9, 2023

[6.4.0] Fix handling of non-ASCII characters in archive entry file names #19765

Merged

mattyclarkson mentioned this pull request Feb 24, 2024

Remove Latin-1 workaround on Bazel 6.4.0+ bazelbuild/rules_go#3872

Merged

mattyclarkson added a commit to mattyclarkson/bazel_features that referenced this pull request Feb 24, 2024

feat: add extract_supports_unicode_filenames

4394c76

A feature gate for bazelbuild/bazel#18448

mattyclarkson mentioned this pull request Feb 24, 2024

feat: add extract_supports_unicode_filenames bazel-contrib/bazel_features#41

Merged

fmeum pushed a commit to bazel-contrib/bazel_features that referenced this pull request Feb 24, 2024

feat: add extract_supports_unicode_filenames (#41)

61577d7

A feature gate for bazelbuild/bazel#18448 Required for bazelbuild/rules_go#3872

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling of non-ASCII characters in archive entry file names #18448

Fix handling of non-ASCII characters in archive entry file names #18448

fmeum commented May 19, 2023 •

edited

Loading

fmeum May 19, 2023 •

edited

Loading

meteorcloudy Jun 6, 2023

lberki Jun 6, 2023

fmeum Jun 6, 2023

lberki Jun 7, 2023

fmeum Sep 4, 2023

fmeum Sep 5, 2023

fmeum Sep 24, 2023

brandjon Sep 27, 2023

fmeum Sep 27, 2023

lberki commented Sep 25, 2023

brandjon commented Sep 28, 2023

fmeum commented Sep 28, 2023

fmeum commented Oct 6, 2023

meteorcloudy commented Oct 6, 2023

fmeum commented Oct 6, 2023 •

edited

Loading

meteorcloudy commented Oct 6, 2023

brandjon commented Oct 11, 2023

Fix handling of non-ASCII characters in archive entry file names #18448

Fix handling of non-ASCII characters in archive entry file names #18448

Conversation

fmeum commented May 19, 2023 • edited Loading

fmeum May 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lberki commented Sep 25, 2023

brandjon commented Sep 28, 2023

fmeum commented Sep 28, 2023

fmeum commented Oct 6, 2023

meteorcloudy commented Oct 6, 2023

fmeum commented Oct 6, 2023 • edited Loading

meteorcloudy commented Oct 6, 2023

brandjon commented Oct 11, 2023

fmeum commented May 19, 2023 •

edited

Loading

fmeum May 19, 2023 •

edited

Loading

fmeum commented Oct 6, 2023 •

edited

Loading