Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validation for get_json_object #15013

Closed

Conversation

res-life
Copy link
Contributor

@res-life res-life commented Feb 9, 2024

Description

Add a Json parser to do Json validation for Spark get_json_object.
Like Spark get_json_object, Json parser parses by token, the parser can be used when handing other issues of get_json_object
This PR do the validation before actual work of get_json_object.
In future, will move the validation into the main logic of get_json_object.

closes NVIDIA/spark-rapids#10194

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

perf test

validation overhead is minor, after adding the validation, speedup drops from 30.34 x to 29.60 x

todo

  • code format
  • test case compile

Signed-off-by: Chong Gao <res_life@163.com>
Copy link

copy-pr-bot bot commented Feb 9, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 9, 2024
@res-life res-life requested a review from revans2 February 9, 2024 10:21
@res-life res-life changed the title Add validation for get_json_object [WIP]Add validation for get_json_object Feb 9, 2024
@davidwendt
Copy link
Contributor

The get_json_path was removed from strings in 23.12.
Also, cpp/src/strings/json/json_path.cu was moved to cpp/src/json
Likewise, I would recommend moving cpp/include/cudf/strings/detail/json_parser.cuh and it's functions out of the strings namespace. Also, if it is only included by a single .cu it probably does not need to be in a separate header. Or at the very least could be placed in cpp/src/json instead of cpp/include/cudf

@github-actions github-actions bot added the conda label Feb 21, 2024
@github-actions github-actions bot removed the conda label Feb 21, 2024
@GregoryKimball
Copy link
Contributor

Thank you @res-life for suggesting this change. I looked through the diff but I haven't had time to study it carefully.

  • Does the JSON validation apply to the "JSON path" parameter or to the strings column data?
  • Can the validation step be separated from "get_json_object", so that existing users could access the existing (presumably faster) behavior?

@res-life
Copy link
Contributor Author

res-life commented Feb 21, 2024

Does the JSON validation apply to the "JSON path" parameter or to the strings column data?

Only apply to JSON strings, not apply to the "JSON path" parameter.

Can the validation step be separated from "get_json_object", so that existing users could access the existing (presumably faster) behavior?

The validation is already separated from "get_json_object".
It first validate the whole JSON string, if JSON is invalid, then return null, otherwise do the existing logic.

@github-actions github-actions bot added the Java Affects Java cuDF API. label Feb 21, 2024
@res-life
Copy link
Contributor Author

/ok to test

@res-life res-life added bug Something isn't working 2 - In Progress Currently a work in progress non-breaking Non-breaking change labels Feb 22, 2024
@revans2
Copy link
Contributor

revans2 commented Feb 23, 2024

I am not 100% sure of the games that @nvdbaranec played when he first wrote this to make it all work, but his code has some really high limits (I have not hit them yet in my tests). I run into a limit of 254 on the regular JSON parser in CUDF so I would start with 254 for now, and then we can figure out how to deal with something deeper after that.

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this looks good. All of the tests I have been working on associated with NVIDIA/spark-rapids#10454 pass with this patch and I only saw a performance regression of 14% with validation enabled. That said I would like to see what it the performance looks like if we have a config to enable/disable validation, I would like to see if there are any cudf performance tests, and also if we should drop some of these configs from get_json_object.

If the long term goal it to eventually use the standard JSON parser in io, then exposing configs that we might have to deprecate feels less than ideal unless we know that CUDF would be interested in supporting them in the future too.

/**
* JSON token enum
*/
enum class json_token {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't totally understand why this file needs to be outside of json_path.cu like was mentioned before #15013 (comment)

I see that we have some unit tests to verify that the code compiles and run correctly on the CPU. So I will leave it up to you and the CUDF team to debate the merits of those tests. I personally am fine either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree too. This file contains device functions. Unless the device functions in this file are used by public APIs directly, this header should be in src/. For unit tests, this header could be directly included from src/. A few unit tests already do that.

@@ -729,6 +737,15 @@ __device__ parse_result parse_json_path(json_state& j_state,
};
push_context(j_state, commands, false);

// First do JSON validation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I think the long term goal is to use the same JSON parser everywhere and all of this would go away. The JSON input format code does do validation. I see this as a short term solution.

@res-life
Copy link
Contributor Author

/ok to test

@res-life
Copy link
Contributor Author

/ok to test

@res-life
Copy link
Contributor Author

Test cases fixed.
Next steps:

  • Change max JSON nesting depth to 254
  • Move parser code to json_path.cu

@res-life
Copy link
Contributor Author

/ok to test

@res-life
Copy link
Contributor Author

/ok to test

@res-life
Copy link
Contributor Author

/ok to test

@res-life
Copy link
Contributor Author

/ok to test

@res-life
Copy link
Contributor Author

/ok to test

@res-life
Copy link
Contributor Author

/ok to test

@res-life
Copy link
Contributor Author

/ok to test

@res-life
Copy link
Contributor Author

/ok to test

@res-life
Copy link
Contributor Author

/ok to test

@res-life
Copy link
Contributor Author

res-life commented Mar 6, 2024

Moved to spark-rapids-jni repo: NVIDIA/spark-rapids-jni#1836
Closed this.

@res-life res-life closed this Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress bug Something isn't working CMake CMake build issue Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] GetJsonObject does not validate the input is JSON in the same way as Spark
6 participants