Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix JSON parsing in json to struct function #8204

Closed
cindyyuanjiang opened this issue Apr 28, 2023 · 7 comments
Closed

Fix JSON parsing in json to struct function #8204

cindyyuanjiang opened this issue Apr 28, 2023 · 7 comments
Assignees
Labels
task Work required that improves the product but is not user facing

Comments

@cindyyuanjiang
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
This is to follow up with json to struct function in #8174. The current JSON parsing in the implementation is having CUDF do name and type inference, which has issue with parsing and error handling. In particular, with parsing ints, if the int is too large or the format is not exactly correct, then CUDF will return a partial value (best effort to get something out) but spark is strict and will return a null.

Describe the solution you'd like
"We have put in a lot of work to make our cast code do the right thing when going from a string to an int, so if we ask CUDF to return Strings values instead of int values and then cast them afterwards we get the parsing that we want."

@cindyyuanjiang cindyyuanjiang added feature request New feature or request ? - Needs Triage Need team to review and classify labels Apr 28, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label May 2, 2023
@andygrove
Copy link
Contributor

andygrove commented Jan 22, 2024

@cindyyuanjiang @revans2 I think this issue can be closed now? When reading JSON we ask cuDF for strings for all primitive type and then cast to the primitive type in the plugin.

Relevant code is in GpuTextBasedPartitionReader#readToTable.

@revans2
Copy link
Collaborator

revans2 commented Jan 22, 2024

If the tests pass, then I am good with closing this.

@andygrove
Copy link
Contributor

Ok, so there are still differences in how we handle types between from_json and json scan, so this is still valid. I will work on making these consistent.

@andygrove andygrove self-assigned this Jan 22, 2024
@andygrove
Copy link
Contributor

I think we will need this cuDF feature before we can resolve this issue:

rapidsai/cudf#14830

@andygrove
Copy link
Contributor

I propose that we make GpuJsonToStruct consistent with GpuJsonScan by reading top-level primitives as strings, then cast to the expected type, and fall back to CPU if the source JSON contains structs (until rapidsai/cudf#14830 is implemented).

@revans2 Does that sound reasonable to you?

@revans2
Copy link
Collaborator

revans2 commented Jan 22, 2024

That sounds good to me.

@revans2
Copy link
Collaborator

revans2 commented Mar 13, 2024

The main part of the proposed fix is already covered by #10542 and follow on issues have been filed for remaining issues related to this.

@revans2 revans2 closed this as completed Mar 13, 2024
@sameerz sameerz added task Work required that improves the product but is not user facing and removed feature request New feature or request labels Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

No branches or pull requests

5 participants