Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] JSON white space normalization removes too much for unquoted values #15280

Closed
revans2 opened this issue Mar 12, 2024 · 2 comments · Fixed by NVIDIA/spark-rapids#11456
Closed
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Mar 12, 2024

Describe the bug
I don't consider this to be too critical, but it is a regression compared to not turning on white space normalization.

An unquoted JSON value is not allowed to have white space in the middle of it, but it looks like the white space normalization is cleaning it up, and converting it from invalid JSON to valid JSON, which makes some of my tests fail.

Steps/Code to reproduce bug

TEST_F(JsonWSNormalizationTest, Unquoted_Bad_Boolean_Spaces)
{
  std::string input  = R"({"A": tr ue })";
  // but it actually is {"A":true}
  std::string output = R"({"A":tr ue})";
  run_test(input, output);
}

TEST_F(JsonWSNormalizationTest, Unquoted_Bad_Float_Spaces)
{
  std::string input  = R"({"A": 1 . 0 })";
  // but it actually is {"A":1.0}
  std::string output = R"({"A":1 . 0})";
  run_test(input, output);
}

TEST_F(JsonWSNormalizationTest, Unquoted_Bad_Sci_Spaces)
{
  std::string input  = R"({"A": 1 E 1 })";
  // but it actually is {"A":1E1}
  std::string output = R"({"A":1 E 1})";
  run_test(input, output);
}

The above tests fail, because there are too many white space values being removed.

@revans2 revans2 added bug Something isn't working cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Mar 12, 2024
@GregoryKimball
Copy link
Contributor

GregoryKimball commented Mar 12, 2024

@revans2 thank you for sharing this update - it's hilarious! We will need to brainstorm about our options here.

  • Let invalid JSON get corrected. @revans2 are you willing to accept this behavior?
  • libcudf could modify the normalization FST to avoid dropping the whitespace that occurs within a JSON value. We would need to add more states to do this, so it would cost more time and memory. @shrshi what do you think?
  • Spark-RAPIDS could run a validity check and flag invalid rows, but perhaps this would not work for the cases where Spark needs to recover records from invalid lines (e.g. {"A":1}{"B":1 2 3}). @revans2 what do you think?

@revans2
Copy link
Contributor Author

revans2 commented Jun 25, 2024

Sorry it took me so long to respond to this. I would want the white space removal to follow what CUDF already does for validation with where it ignored white space.

https://www.json.org/json-en.html

Provides some nice flow charts. So for an object white space after a { can be removed. As can any white space before or after a key.

For an array white space after a [ can be removed (but that is probably already covered by value.

For a value any leasing or trailing white space can be removed, but white space within a number, true, false, null, or string cannot be.

Also white space consists of space, linefeed, carriage return or horizontal tab.

Another option would be to remove the white space afterwards, but we would have to have a way to know that a row that was returned came from a nested object and not a regular value, and we would need a way for the DFA to work on a String column instead of a buffer.

rapids-bot bot pushed a commit that referenced this issue Sep 19, 2024
… JSONL inputs (#16759)

Addresses #15280 

Whitespace normalization is expected to remove unquoted whitespace characters in JSON lines inputs. However, in the cases where the JSON line is invalid due to an unquoted whitespace occurring in between numbers or literals, the existing normalization implementation is incorrect since it removes these invalidating whitespaces and makes the line valid.

This PR implements the normalization as a post-processing step on only nested columns forced as string columns.
Idea:
1. Create a single buffer by concatenating the rows of the string column. Create segment offsets and lengths array for concatenated buffer
2. Run a complementary whitespace normalization FST i.e. NOP for non-whitespace and quoted whitespace characters, and output indices of unquoted whitespace characters
3. Update segment lengths based on the number of output indices between segment offsets
4. Remove characters at output indices from concatenated buffer.
5. Return updated buffer, segment lengths and updated segment offsets

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #16759
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Status: Done
Status: In progress
Development

Successfully merging a pull request may close this issue.

2 participants