[BUG] nds_transcode.py is not handling international characters correctly #170

jbrennan333 · 2023-10-26T19:41:45Z

Describe the bug
The raw data files from TPC-DS are in ISO-8859 format. In this format, the Ô character is encoded as 0xd4.
In nds_transcode.py, we read these raw CSV files with the default encoding of UTF-8, so we don't handle the international characters correctly.

Comment from TPC-DS spec:

The data generated by dsdgen includes some international characters. Examples of international
characters are Ô and É. The database must preserve these characters during loading and processing by using a
character encoding such as ISO/IEC 8859-1 that includes these characters

If we do the transcoding with the GPU, the 0xd4 character is passed through to the resulting output file, and it is an invalid UTF8 character. I ran into this when comparing CPU transcoded data (specifically the customer table) to GPU transcoded data. In the case of CPU, it translates the invalid character to 0xefbfbd, so if you try to compare the resulting output files, all rows with these international characters are found to differ. But both the CPU and GPU generated files are incorrect in that these international characters have been replaced with the wrong encoding.

If you modify nds_transcode.py by adding .option("encoding", "ISO-8859-1") to the csv read, then we correctly transcode it to 0xc394 when we write it in UTF-8 format.

Steps/Code to reproduce bug
Use the nds_transcode.py script to transcode the customer file to parquet format with no compression.
Use a binary viewer like xxd to examine the output file and verify that the character is correct.
It appears in the string CÔTE D'IVOIRE, so I usually search for VOIR and then look at the encoding for the Ô character. For example:

04199e50: 0043 c394 5445 2044 2749 564f 4952 450d  .C..TE D'IVOIRE.

Expected behavior
International characters should be transcoded correctly from ISO-8859 to the output encoding.

The text was updated successfully, but these errors were encountered:

wjxiz1992 · 2023-10-27T02:56:19Z

Thanks for narrowing down to the root, I'll make a fix for it!

jbrennan333 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Oct 26, 2023

wjxiz1992 self-assigned this Oct 27, 2023

wjxiz1992 mentioned this issue Oct 27, 2023

Use ISO-8859 codec to load CSV files #171

Merged

This was referenced Oct 27, 2023

[BUG] Invalid characters in CSV are handled differently when reading from GPU NVIDIA/spark-rapids#9560

Open

[FEA] Validate nvcomp-3.0 with spark rapids plugin NVIDIA/spark-rapids#9461

Closed

wjxiz1992 closed this as completed in #171 Nov 1, 2023

mattahrens removed the ? - Needs Triage Need team to review and classify label Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] nds_transcode.py is not handling international characters correctly #170

[BUG] nds_transcode.py is not handling international characters correctly #170

jbrennan333 commented Oct 26, 2023

wjxiz1992 commented Oct 27, 2023

[BUG] nds_transcode.py is not handling international characters correctly #170

[BUG] nds_transcode.py is not handling international characters correctly #170

Comments

jbrennan333 commented Oct 26, 2023

wjxiz1992 commented Oct 27, 2023