[BUG] JSON white space normalization removes too much for unquoted values #15280

revans2 · 2024-03-12T18:56:35Z

Describe the bug
I don't consider this to be too critical, but it is a regression compared to not turning on white space normalization.

An unquoted JSON value is not allowed to have white space in the middle of it, but it looks like the white space normalization is cleaning it up, and converting it from invalid JSON to valid JSON, which makes some of my tests fail.

Steps/Code to reproduce bug

TEST_F(JsonWSNormalizationTest, Unquoted_Bad_Boolean_Spaces)
{
  std::string input  = R"({"A": tr ue })";
  // but it actually is {"A":true}
  std::string output = R"({"A":tr ue})";
  run_test(input, output);
}

TEST_F(JsonWSNormalizationTest, Unquoted_Bad_Float_Spaces)
{
  std::string input  = R"({"A": 1 . 0 })";
  // but it actually is {"A":1.0}
  std::string output = R"({"A":1 . 0})";
  run_test(input, output);
}

TEST_F(JsonWSNormalizationTest, Unquoted_Bad_Sci_Spaces)
{
  std::string input  = R"({"A": 1 E 1 })";
  // but it actually is {"A":1E1}
  std::string output = R"({"A":1 E 1})";
  run_test(input, output);
}

The above tests fail, because there are too many white space values being removed.

The text was updated successfully, but these errors were encountered:

GregoryKimball · 2024-03-12T22:25:37Z

@revans2 thank you for sharing this update - it's hilarious! We will need to brainstorm about our options here.

Let invalid JSON get corrected. @revans2 are you willing to accept this behavior?
libcudf could modify the normalization FST to avoid dropping the whitespace that occurs within a JSON value. We would need to add more states to do this, so it would cost more time and memory. @shrshi what do you think?
Spark-RAPIDS could run a validity check and flag invalid rows, but perhaps this would not work for the cases where Spark needs to recover records from invalid lines (e.g. {"A":1}{"B":1 2 3}). @revans2 what do you think?

revans2 · 2024-06-25T15:37:06Z

Sorry it took me so long to respond to this. I would want the white space removal to follow what CUDF already does for validation with where it ignored white space.

https://www.json.org/json-en.html

Provides some nice flow charts. So for an object white space after a { can be removed. As can any white space before or after a key.

For an array white space after a [ can be removed (but that is probably already covered by value.

For a value any leasing or trailing white space can be removed, but white space within a number, true, false, null, or string cannot be.

Also white space consists of space, linefeed, carriage return or horizontal tab.

Another option would be to remove the white space afterwards, but we would have to have a way to know that a row that was returned came from a nested object and not a regular value, and we would need a way for the DFA to work on a String column instead of a buffer.

… JSONL inputs (#16759) Addresses #15280 Whitespace normalization is expected to remove unquoted whitespace characters in JSON lines inputs. However, in the cases where the JSON line is invalid due to an unquoted whitespace occurring in between numbers or literals, the existing normalization implementation is incorrect since it removes these invalidating whitespaces and makes the line valid. This PR implements the normalization as a post-processing step on only nested columns forced as string columns. Idea: 1. Create a single buffer by concatenating the rows of the string column. Create segment offsets and lengths array for concatenated buffer 2. Run a complementary whitespace normalization FST i.e. NOP for non-whitespace and quoted whitespace characters, and output indices of unquoted whitespace characters 3. Update segment lengths based on the number of output indices between segment offsets 4. Remove characters at output indices from concatenated buffer. 5. Return updated buffer, segment lengths and updated segment offsets Authors: - Shruti Shivakumar (https://github.com/shrshi) - Karthikeyan (https://github.com/karthikeyann) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) - Karthikeyan (https://github.com/karthikeyann) URL: #16759

revans2 added bug Something isn't working cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Mar 12, 2024

GregoryKimball added this to the Nested JSON reader milestone Mar 12, 2024

GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Mar 12, 2024

GregoryKimball mentioned this issue Mar 12, 2024

[FEA] JSON reader improvements for Spark-RAPIDS #13525

Open

shrshi mentioned this issue Sep 5, 2024

Whitespace normalization of nested column coerced as string column in JSONL inputs #16759

Merged

3 tasks

revans2 mentioned this issue Sep 10, 2024

Enable tests for all JSON white space normalization NVIDIA/spark-rapids#11456

Merged

revans2 closed this as completed in NVIDIA/spark-rapids#11456 Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] JSON white space normalization removes too much for unquoted values #15280

[BUG] JSON white space normalization removes too much for unquoted values #15280

revans2 commented Mar 12, 2024

GregoryKimball commented Mar 12, 2024 •

edited

Loading

revans2 commented Jun 25, 2024

[BUG] JSON white space normalization removes too much for unquoted values #15280

[BUG] JSON white space normalization removes too much for unquoted values #15280

Comments

revans2 commented Mar 12, 2024

GregoryKimball commented Mar 12, 2024 • edited Loading

revans2 commented Jun 25, 2024

GregoryKimball commented Mar 12, 2024 •

edited

Loading