TarWriter uses ASCII to write down fields and should use UTF8 instead #75482

jozkee · 2022-09-12T20:45:31Z

All formats use ASCII encoding to write down fields, which is unfortunate because a UTF8 name like "földër" will look garbled when read back.

MemoryStream ms = new MemoryStream();
TarWriter writer = new(ms, leaveOpen: true);
            
GnuTarEntry gnuEntry = new(TarEntryType.Directory, "földër");
writer.WriteEntry(gnuEntry);

writer.Dispose();

ms.Position = 0;
TarReader reader = new(ms);
TarEntry readEntry = reader.GetNextEntry();
Console.WriteLine(readEntry.Name); // Prints "f?ld?r".
reader.Dispose();

This is visually mitigated on Pax because UTF8 encoding is used to write down extended attributes and fortunately, that's the default format. However, legacy fields on Pax entries do get garbled but when using .NET APIs, we overwrite the legacy fields with the contents of the extended attributes. So AFAIK, the issue in pax shows only if you look at the bytes of the tar archive:

cc @carlossanlop @stephentoub @danmoseley @tmds

ghost · 2022-09-12T20:45:44Z

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

All formats use ASCII encoding to write down fields, which is unfortunate because a UTF8 name like "földër" will look garbled when read back.

MemoryStream ms = new MemoryStream();
TarWriter writer = new(ms, leaveOpen: true);
            
GnuTarEntry gnuEntry = new(TarEntryType.Directory, "földër");
writer.WriteEntry(gnuEntry);

writer.Dispose();

ms.Position = 0;
TarReader reader = new(ms);
TarEntry readEntry = reader.GetNextEntry();
Console.WriteLine(readEntry.Name); // Prints "f?ld?r".
reader.Dispose();

This is visually mitigated on Pax because UTF8 encoding is used to write down extended attributes and fortunately, that's the default format. However, legacy fields on Pax entries do get garbled but when using .NET APIs, we overwrite the legacy fields with the contents of the extended attributes. So AFAIK, the issue in pax shows only if you look at the bytes of the tar archive:

cc @carlossanlop @stephentoub @danmoseley @tmds

Author:	Jozkee
Assignees:	-
Labels:	`area-System.IO`
Milestone:	8.0.0

danmoseley · 2022-09-12T21:01:24Z

Should we fix for 7.0? What is the risk of breaking something else if we fix this?

stephentoub · 2022-09-12T21:05:00Z

#75373 (comment)

What is the impact if we don't do this? Does it prevent roundtripping / another tool deserializing tar archives produced by TarWriter?

There's a high liklihood file names will contain non-ASCII characters.

What do other tools do? Why did we start with ASCII?

jozkee · 2022-09-12T21:10:52Z

Does it prevent roundtripping / another tool deserializing tar archives produced by TarWriter?

I tried opening with 7Zip an archive created with TarFile.CreateFromDirectory and that worked just fine.

What do other tools do?

I tried BDS Tar on WSL Ubuntu with all the supported formats and all of them encode the name as UTF8.

What is the impact if we don't do this?

For formats other than Pax, containing a name or some other field (such as LinkName) that contains non-ASCII characters, those characters will be garbled as shown in the picture of the original post.
Keep in mind that Pax is the default.

stephentoub · 2022-09-12T21:13:10Z

For formats other than Pax, containing a name or some other field (such as LinkName) that contains non-ASCII characters, those characters will be garbled as shown in the picture of the original post.

But they'll be garbled not just if someone inspects the value in a debugger but also if any other non-.NET tool tries to untar them, right? e.g. they'll produce garbage names in the file system?

Keep in mind that Pax is the default.

Sure... and we make it easy to do something other than the default.

danmoseley · 2022-09-12T21:13:45Z

Is there any impact on reading archives using our API, created by other Tar tools, and using non-ASCII metadata?

jozkee · 2022-09-12T21:13:50Z

What is the risk of breaking something else if we fix this?

This might be as easy as replacing Encoding.ASCII for Encoding.UTF8 and nothig else.

jozkee · 2022-09-12T21:14:48Z

they'll produce garbage names in the file system?

Right, even our own TarReader will read them back as garbled as I show in the snippet in the original post.

stephentoub · 2022-09-12T21:15:37Z

they'll produce garbage names in the file system?

Right

Then this needs to be fixed for 7.0.

This might be as easy as replacing Encoding.ASCII for Encoding.UTF8 and nothig else.

It won't be. Several code paths assume that the input and output lengths will be the same; that's a valid assumption for Encoding.ASCII but not for Encoding.UTF8.

danmoseley · 2022-09-12T21:42:32Z

You'll want to include chars that encode as more than 2 bytes, too, not just földër (föld€r ? )

We should have tests for every string in our API both reading and writing that use non ASCII chars -- I guess that includes entry name and link name, and extended attributes, possibly user/group although they may be limited by POSIX to a subset of ASCII. We'll need to do interop tests for each value in each format.

jozkee added the area-System.IO label Sep 12, 2022

jozkee added this to the 8.0.0 milestone Sep 12, 2022

jozkee mentioned this issue Sep 12, 2022

[release/7.0] Fix prefix writing on TarHeaderWrite #75373

Merged

jozkee modified the milestones: 8.0.0, 7.0.0 Sep 12, 2022

jozkee added the blocking-release label Sep 12, 2022

jozkee mentioned this issue Sep 20, 2022

Use UTF8 encoding on Tar string fields #75902

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Sep 20, 2022

jozkee closed this as completed in #75902 Sep 28, 2022

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 28, 2022

jozkee mentioned this issue Oct 4, 2022

[release/7.0] backport Tar fixes #76322

Merged

ghost locked as resolved and limited conversation to collaborators Oct 28, 2022

carlossanlop added area-System.Formats.Tar and removed area-System.IO labels Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TarWriter uses ASCII to write down fields and should use UTF8 instead #75482

TarWriter uses ASCII to write down fields and should use UTF8 instead #75482

jozkee commented Sep 12, 2022

ghost commented Sep 12, 2022

danmoseley commented Sep 12, 2022

stephentoub commented Sep 12, 2022 •

edited

Loading

jozkee commented Sep 12, 2022

stephentoub commented Sep 12, 2022 •

edited

Loading

danmoseley commented Sep 12, 2022

jozkee commented Sep 12, 2022

jozkee commented Sep 12, 2022

stephentoub commented Sep 12, 2022

danmoseley commented Sep 12, 2022 •

edited

Loading

TarWriter uses ASCII to write down fields and should use UTF8 instead #75482

TarWriter uses ASCII to write down fields and should use UTF8 instead #75482

Comments

jozkee commented Sep 12, 2022

ghost commented Sep 12, 2022

danmoseley commented Sep 12, 2022

stephentoub commented Sep 12, 2022 • edited Loading

jozkee commented Sep 12, 2022

stephentoub commented Sep 12, 2022 • edited Loading

danmoseley commented Sep 12, 2022

jozkee commented Sep 12, 2022

jozkee commented Sep 12, 2022

stephentoub commented Sep 12, 2022

danmoseley commented Sep 12, 2022 • edited Loading

stephentoub commented Sep 12, 2022 •

edited

Loading

stephentoub commented Sep 12, 2022 •

edited

Loading

danmoseley commented Sep 12, 2022 •

edited

Loading