Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JPEG schema US-ASCII encoding leads to odd translation of non-ASCII characters #3

Open
lblatchford opened this issue Dec 3, 2020 · 1 comment

Comments

@lblatchford
Copy link

dirtyword5x.jpg has a non-ASCII byte 0xa8 at offset 0x1a1b.
test25.jpg has the same non-ASCII byte at offset 0x17ef.

When these files are parsed and then unparsed, the 0xa8 becomes 0x3f.
After the parse, the infoset has the 0xa8 translated to 0xEFBFBD, the UTF-8 replacement character.
If the encoding in the schema is changed from US-ASCII to UTF-8, 0xa8 is changed to 0xEFBFBD in the final JPEG, which is more clear.

Should the encoding be UTF-8 or something else other than US-ASCII?

dirtyword5x
test25

@stevedlawrence
Copy link
Member

It looks like that is a COM field in the jpeg file, which is used for comments. For this field, the JPEG specification says "the interpretation is left to the application". So it seems there is no standard encoding for this field, and this likely applies to all other string fields in a jpeg file. To preserve the comment and other field data exactly, it probably makes sense to change the encoding to ISO-8859-1. This encoding has no illegal values (which both US-ASCII and UTF-8 have) and so comment data will never be replaced due to encoding errors--this should allow parsing and unparsing exaclty the same. It does mean daffodil won't detect garbage/malicious comment data, but that can be handled outside of daffodil if it's important to the use case.

Would you like to create a pull request switching to ISO-8859-1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants