Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TarWriter uses ASCII to write down fields and should use UTF8 instead #75482

Closed
jozkee opened this issue Sep 12, 2022 · 10 comments · Fixed by #75902
Closed

TarWriter uses ASCII to write down fields and should use UTF8 instead #75482

jozkee opened this issue Sep 12, 2022 · 10 comments · Fixed by #75902

Comments

@jozkee
Copy link
Member

jozkee commented Sep 12, 2022

All formats use ASCII encoding to write down fields, which is unfortunate because a UTF8 name like "földër" will look garbled when read back.

MemoryStream ms = new MemoryStream();
TarWriter writer = new(ms, leaveOpen: true);
            
GnuTarEntry gnuEntry = new(TarEntryType.Directory, "földër");
writer.WriteEntry(gnuEntry);

writer.Dispose();

ms.Position = 0;
TarReader reader = new(ms);
TarEntry readEntry = reader.GetNextEntry();
Console.WriteLine(readEntry.Name); // Prints "f?ld?r".
reader.Dispose();

This is visually mitigated on Pax because UTF8 encoding is used to write down extended attributes and fortunately, that's the default format. However, legacy fields on Pax entries do get garbled but when using .NET APIs, we overwrite the legacy fields with the contents of the extended attributes. So AFAIK, the issue in pax shows only if you look at the bytes of the tar archive:
image

cc @carlossanlop @stephentoub @danmoseley @tmds

@jozkee jozkee added this to the 8.0.0 milestone Sep 12, 2022
@ghost
Copy link

ghost commented Sep 12, 2022

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

All formats use ASCII encoding to write down fields, which is unfortunate because a UTF8 name like "földër" will look garbled when read back.

MemoryStream ms = new MemoryStream();
TarWriter writer = new(ms, leaveOpen: true);
            
GnuTarEntry gnuEntry = new(TarEntryType.Directory, "földër");
writer.WriteEntry(gnuEntry);

writer.Dispose();

ms.Position = 0;
TarReader reader = new(ms);
TarEntry readEntry = reader.GetNextEntry();
Console.WriteLine(readEntry.Name); // Prints "f?ld?r".
reader.Dispose();

This is visually mitigated on Pax because UTF8 encoding is used to write down extended attributes and fortunately, that's the default format. However, legacy fields on Pax entries do get garbled but when using .NET APIs, we overwrite the legacy fields with the contents of the extended attributes. So AFAIK, the issue in pax shows only if you look at the bytes of the tar archive:
image

cc @carlossanlop @stephentoub @danmoseley @tmds

Author: Jozkee
Assignees: -
Labels:

area-System.IO

Milestone: 8.0.0

@danmoseley
Copy link
Member

Should we fix for 7.0? What is the risk of breaking something else if we fix this?

@stephentoub
Copy link
Member

stephentoub commented Sep 12, 2022

#75373 (comment)

What is the impact if we don't do this? Does it prevent roundtripping / another tool deserializing tar archives produced by TarWriter?

There's a high liklihood file names will contain non-ASCII characters.

What do other tools do? Why did we start with ASCII?

@jozkee
Copy link
Member Author

jozkee commented Sep 12, 2022

Does it prevent roundtripping / another tool deserializing tar archives produced by TarWriter?

I tried opening with 7Zip an archive created with TarFile.CreateFromDirectory and that worked just fine.

What do other tools do?

I tried BDS Tar on WSL Ubuntu with all the supported formats and all of them encode the name as UTF8.

What is the impact if we don't do this?

For formats other than Pax, containing a name or some other field (such as LinkName) that contains non-ASCII characters, those characters will be garbled as shown in the picture of the original post.
Keep in mind that Pax is the default.

@stephentoub
Copy link
Member

stephentoub commented Sep 12, 2022

For formats other than Pax, containing a name or some other field (such as LinkName) that contains non-ASCII characters, those characters will be garbled as shown in the picture of the original post.

But they'll be garbled not just if someone inspects the value in a debugger but also if any other non-.NET tool tries to untar them, right? e.g. they'll produce garbage names in the file system?

Keep in mind that Pax is the default.

Sure... and we make it easy to do something other than the default.

@danmoseley
Copy link
Member

Is there any impact on reading archives using our API, created by other Tar tools, and using non-ASCII metadata?

@jozkee
Copy link
Member Author

jozkee commented Sep 12, 2022

What is the risk of breaking something else if we fix this?

This might be as easy as replacing Encoding.ASCII for Encoding.UTF8 and nothig else.

@jozkee
Copy link
Member Author

jozkee commented Sep 12, 2022

they'll produce garbage names in the file system?

Right, even our own TarReader will read them back as garbled as I show in the snippet in the original post.

@stephentoub
Copy link
Member

they'll produce garbage names in the file system?

Right

Then this needs to be fixed for 7.0.

This might be as easy as replacing Encoding.ASCII for Encoding.UTF8 and nothig else.

It won't be. Several code paths assume that the input and output lengths will be the same; that's a valid assumption for Encoding.ASCII but not for Encoding.UTF8.

@jozkee jozkee modified the milestones: 8.0.0, 7.0.0 Sep 12, 2022
@danmoseley
Copy link
Member

danmoseley commented Sep 12, 2022

You'll want to include chars that encode as more than 2 bytes, too, not just földër (föld€r ? )

We should have tests for every string in our API both reading and writing that use non ASCII chars -- I guess that includes entry name and link name, and extended attributes, possibly user/group although they may be limited by POSIX to a subset of ASCII. We'll need to do interop tests for each value in each format.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Sep 20, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 28, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Oct 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants