Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] nds_transcode.py is not handling international characters correctly #170

Closed
jbrennan333 opened this issue Oct 26, 2023 · 1 comment · Fixed by #171
Closed

[BUG] nds_transcode.py is not handling international characters correctly #170

jbrennan333 opened this issue Oct 26, 2023 · 1 comment · Fixed by #171
Assignees
Labels
bug Something isn't working

Comments

@jbrennan333
Copy link
Collaborator

Describe the bug
The raw data files from TPC-DS are in ISO-8859 format. In this format, the Ô character is encoded as 0xd4.
In nds_transcode.py, we read these raw CSV files with the default encoding of UTF-8, so we don't handle the international characters correctly.

Comment from TPC-DS spec:

The data generated by dsdgen includes some international characters. Examples of international
characters are Ô and É. The database must preserve these characters during loading and processing by using a
character encoding such as ISO/IEC 8859-1 that includes these characters

If we do the transcoding with the GPU, the 0xd4 character is passed through to the resulting output file, and it is an invalid UTF8 character. I ran into this when comparing CPU transcoded data (specifically the customer table) to GPU transcoded data. In the case of CPU, it translates the invalid character to 0xefbfbd, so if you try to compare the resulting output files, all rows with these international characters are found to differ. But both the CPU and GPU generated files are incorrect in that these international characters have been replaced with the wrong encoding.

If you modify nds_transcode.py by adding .option("encoding", "ISO-8859-1") to the csv read, then we correctly transcode it to 0xc394 when we write it in UTF-8 format.

Steps/Code to reproduce bug
Use the nds_transcode.py script to transcode the customer file to parquet format with no compression.
Use a binary viewer like xxd to examine the output file and verify that the character is correct.
It appears in the string CÔTE D'IVOIRE, so I usually search for VOIR and then look at the encoding for the Ô character. For example:

04199e50: 0043 c394 5445 2044 2749 564f 4952 450d  .C..TE D'IVOIRE.

Expected behavior
International characters should be transcoded correctly from ISO-8859 to the output encoding.

@jbrennan333 jbrennan333 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Oct 26, 2023
@wjxiz1992 wjxiz1992 self-assigned this Oct 27, 2023
@wjxiz1992
Copy link
Collaborator

Thanks for narrowing down to the root, I'll make a fix for it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants