-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TarWriter uses ASCII to write down fields and should use UTF8 instead #75482
Comments
Tagging subscribers to this area: @dotnet/area-system-io Issue DetailsAll formats use ASCII encoding to write down fields, which is unfortunate because a UTF8 name like "földër" will look garbled when read back. MemoryStream ms = new MemoryStream();
TarWriter writer = new(ms, leaveOpen: true);
GnuTarEntry gnuEntry = new(TarEntryType.Directory, "földër");
writer.WriteEntry(gnuEntry);
writer.Dispose();
ms.Position = 0;
TarReader reader = new(ms);
TarEntry readEntry = reader.GetNextEntry();
Console.WriteLine(readEntry.Name); // Prints "f?ld?r".
reader.Dispose(); This is visually mitigated on Pax because UTF8 encoding is used to write down extended attributes and fortunately, that's the default format. However, legacy fields on Pax entries do get garbled but when using .NET APIs, we overwrite the legacy fields with the contents of the extended attributes. So AFAIK, the issue in pax shows only if you look at the bytes of the tar archive: cc @carlossanlop @stephentoub @danmoseley @tmds
|
Should we fix for 7.0? What is the risk of breaking something else if we fix this? |
What is the impact if we don't do this? Does it prevent roundtripping / another tool deserializing tar archives produced by TarWriter? There's a high liklihood file names will contain non-ASCII characters. What do other tools do? Why did we start with ASCII? |
I tried opening with 7Zip an archive created with TarFile.CreateFromDirectory and that worked just fine.
I tried BDS Tar on WSL Ubuntu with all the supported formats and all of them encode the name as UTF8.
For formats other than Pax, containing a name or some other field (such as LinkName) that contains non-ASCII characters, those characters will be garbled as shown in the picture of the original post. |
But they'll be garbled not just if someone inspects the value in a debugger but also if any other non-.NET tool tries to untar them, right? e.g. they'll produce garbage names in the file system?
Sure... and we make it easy to do something other than the default. |
Is there any impact on reading archives using our API, created by other Tar tools, and using non-ASCII metadata? |
This might be as easy as replacing |
Right, even our own |
Then this needs to be fixed for 7.0.
It won't be. Several code paths assume that the input and output lengths will be the same; that's a valid assumption for Encoding.ASCII but not for Encoding.UTF8. |
You'll want to include chars that encode as more than 2 bytes, too, not just We should have tests for every string in our API both reading and writing that use non ASCII chars -- I guess that includes entry name and link name, and extended attributes, possibly user/group although they may be limited by POSIX to a subset of ASCII. We'll need to do interop tests for each value in each format. |
All formats use ASCII encoding to write down fields, which is unfortunate because a UTF8 name like "földër" will look garbled when read back.
This is visually mitigated on Pax because UTF8 encoding is used to write down extended attributes and fortunately, that's the default format. However, legacy fields on Pax entries do get garbled but when using .NET APIs, we overwrite the legacy fields with the contents of the extended attributes. So AFAIK, the issue in pax shows only if you look at the bytes of the tar archive:
cc @carlossanlop @stephentoub @danmoseley @tmds
The text was updated successfully, but these errors were encountered: