-
Notifications
You must be signed in to change notification settings - Fork 886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add validation for get_json_object #15013
Add validation for get_json_object #15013
Conversation
Signed-off-by: Chong Gao <res_life@163.com>
The |
Thank you @res-life for suggesting this change. I looked through the diff but I haven't had time to study it carefully.
|
Only apply to JSON strings, not apply to the "JSON path" parameter.
The validation is already separated from "get_json_object". |
/ok to test |
I am not 100% sure of the games that @nvdbaranec played when he first wrote this to make it all work, but his code has some really high limits (I have not hit them yet in my tests). I run into a limit of 254 on the regular JSON parser in CUDF so I would start with 254 for now, and then we can figure out how to deal with something deeper after that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this looks good. All of the tests I have been working on associated with NVIDIA/spark-rapids#10454 pass with this patch and I only saw a performance regression of 14% with validation enabled. That said I would like to see what it the performance looks like if we have a config to enable/disable validation, I would like to see if there are any cudf performance tests, and also if we should drop some of these configs from get_json_object.
If the long term goal it to eventually use the standard JSON parser in io, then exposing configs that we might have to deprecate feels less than ideal unless we know that CUDF would be interested in supporting them in the future too.
/** | ||
* JSON token enum | ||
*/ | ||
enum class json_token { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't totally understand why this file needs to be outside of json_path.cu like was mentioned before #15013 (comment)
I see that we have some unit tests to verify that the code compiles and run correctly on the CPU. So I will leave it up to you and the CUDF team to debate the merits of those tests. I personally am fine either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree too. This file contains device functions. Unless the device functions in this file are used by public APIs directly, this header should be in src/. For unit tests, this header could be directly included from src/. A few unit tests already do that.
@@ -729,6 +737,15 @@ __device__ parse_result parse_json_path(json_state& j_state, | |||
}; | |||
push_context(j_state, commands, false); | |||
|
|||
// First do JSON validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I think the long term goal is to use the same JSON parser everywhere and all of this would go away. The JSON input format code does do validation. I see this as a short term solution.
/ok to test |
/ok to test |
Test cases fixed.
|
/ok to test |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
Moved to spark-rapids-jni repo: NVIDIA/spark-rapids-jni#1836 |
Description
Add a Json parser to do Json validation for Spark
get_json_object
.Like Spark get_json_object, Json parser parses by token, the parser can be used when handing other issues of
get_json_object
This PR do the validation before actual work of
get_json_object
.In future, will move the validation into the main logic of
get_json_object
.closes NVIDIA/spark-rapids#10194
Checklist
perf test
validation overhead is minor, after adding the validation, speedup drops from 30.34 x to 29.60 x
todo