Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] JSON reader improvements for Spark-RAPIDS #13525

Open
GregoryKimball opened this issue Jun 7, 2023 · 4 comments
Open

[FEA] JSON reader improvements for Spark-RAPIDS #13525

GregoryKimball opened this issue Jun 7, 2023 · 4 comments
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jun 7, 2023

libcudf includes a GPU-accelerated JSON reader that uses a finite-state transducer parser combined with token-processing tree algorithms to transform character buffers into columnar data. This issue tracks the technical work leading up to the launch of libcudf's JSON reader as a default component of the Spark-RAPIDS plugin. Please also refer to the Nested JSON reader milestone and Spark-RAPIDS JSON epic.

Spark compatibility issues: Blockers

Status Impact for Spark Change to libcudf
#13344 #12532, Blocker: if any line has an error, libcudf throws an exception Rework state machine to include error states and scrub tokens from lines with error
#14252 #14227, Blocker: Incorrect parsing Fix bug in error recovery state transitions
#14279 #14226, Blocker: requesting alternate error recovery behavior from #13344, where valid data before an error state are preserved Changes in JSON parser pushdown automaton for JSON_LINES_RECOVER option
#14936 #14288, Blocker: libcudf does not have an efficient representation for map types in Spark libcudf does not support map types, and modeling the map types as structs results in poor performance due to one child column per unique key. We will return the struct data that represents map types as string and then the plugin can use unify_json_strings to parse tokens
#14572 #14239, Blocker: fields with mixed types raise an exception add libcudf reader option to return mixed types as strings. Also see improvements in #15236 and #14939
#14545 #10004, Blocker: Can't parse data with single quote variant of JSON when allowSingleQuotes is enabled in Spark Introduce a preprocessing function to normalize single and double quotes as double quotes
#15324 #15303, escaped single quotes have their escapes dropped during quote normalization Adjust quote normalization FST
🔄 #15419 #15390 + #15409, Blocker: race conditions found in nested JSON reader Solve synchronization problems in nested JSON reader
#15260, Blocker: crash in mixed type support
🔄 #15278, Blocker: allow list type to be coerced to string, also see #14239. Without this, Spark-RAPIDS will fallback when user requests a field as "string" Support List types coercion to string
#15277, Blocker: we need to support multi-line JSON objects. Also see #10267 libcudf is scoping a "multi-object" reader

Spark compatibility issues: non-blockers

Status Impact for Spark Change to libcudf
#15222, compatibility problems with leading zeros, "NAN" and escape options None for now. This feature should live in Spark-RAPIDS as a post-processing option for now, based on the approach for get_json_object modeled after Spark CPU code (see NVIDIA/spark-rapids-jni#1836). Then the plugin can set to null any entries from objects that Spark would treat as invalid. Later we could provide Spark-RAPIDS access to raw tokens that they could run through a more efficient validator.
#15033 #14865, Strip whitespace from JSON inputs, otherwise Spark will have to add this in post-processing the coerced strings types Create new normalization pre-processing tool for whitespace
🔄 #14996 #13473, Performance: only process columns in the schema Skip parsing and column creation for keys not specified in the schema
🔄 #15124 Reader option performance is unknown #15041, add JSON reader option benchmarking
Performance: Avoid preprocessing to replace empty lines with {}. Also see #5712 libcudf provides strings column data source
#15280 find a solution when whitespace normalization fixes a line that originally was invalid We could move whitespace normalization after tokenization. Also we would like to address #15277 so that we can remove unquoted newline characters as well.
n/a, Spark-RAPIDS doesn't use byte range reading #15185, reduce IO overhead in JSON byte range reading
n/a, Spark-RAPIDS doesn't use byte range reading #15186, address data loss edge case for byte range reading
reduce peak memory usage add chunking to the JSON reader
#15222, Spark-RAPIDS must return null if any field is invalid Provide token stream to Spark-RAPIDS for validation, including checks or leading zeros, special string numbers like NaN, +INF, -INF, and optional limits for which characters can be escaped
@GregoryKimball GregoryKimball added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Jun 7, 2023
@GregoryKimball GregoryKimball added this to the Nested JSON reader milestone Jun 7, 2023
@GregoryKimball GregoryKimball changed the title [FEA] JSON reader improvements for Spark-RAPIDS [FEA] Story - JSON reader improvements for Spark-RAPIDS Jun 7, 2023
@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment and removed 2 - In Progress Currently a work in progress labels Aug 2, 2023
@GregoryKimball GregoryKimball changed the title [FEA] Story - JSON reader improvements for Spark-RAPIDS [FEA] JSON reader improvements for Spark-RAPIDS Mar 12, 2024
@revans2
Copy link
Contributor

revans2 commented Mar 15, 2024

@GregoryKimball From the Spark perspective The following are in priority order. This is based mostly on how likely I think it is that a customer would see these problems/limitations. And also if we have a work around that would let us enable the JSON parsing functionality by default or not without this change, even if it is limited functionality.

Blocker:

  1. [BUG] mixed_type_as_string throws exception for nested data with nested STRING schema request #15260
  2. [FEA] Support casting of LIST type to STRING in JSON #15278
  3. [FEA] Find a way to support String column input/fixup for JSON parsing #15277
  4. [BUG] JSON white space normalization removes too much for unquoted values #15280
  5. [FEA] JSON parsing is not handling escaped single quote the same as Spark #15303
  6. [FEA] Options to validate JSON fields #15222 - This is likely going to need to be broken down into smaller pieces, not all of which are going to be blockers. I also think we need to what is the best way to support this because there will be a performance impact to others that don't want validation like this.
  7. [FEA] JSON number normalization when returned as a string #15318 I don't want to mark this a blocker, but we have a customer that insists on it. We are in the process of trying to develop normalization code that would work, but a lot of the problem is how can/would we be able to integrate this with the existing JSON parsing code.

Non-Blocker:

  1. [BUG] JSON reader fails to parse files with empty rows #5712 - I think I can work around this, but it will end up being a performance hit if we don't have a better way to deal with it.
  2. [FEA] have an option for the schema to filter the columns read from JSON #14951 / [BUG] JSON reader has no option to return the columns only for the requested schema #13473 - performance optimization (I think these might be dupes of each other)

@GregoryKimball
Copy link
Contributor Author

Thank you @revans2 for summarizing your investigation. We've been studying these requirements and we would like to continue the discussion with you next week.

libcudf will soon address:
1, 2, 5

libcudf is doing design work on:
emitting raw strings (helps with 6, 7)
moving whitespace normalization after tokenization (helps with 4)

libcudf suggests that 3 is a non-blocker

@revans2
Copy link
Contributor

revans2 commented Mar 18, 2024

Like I said I can work around 3, but I don't know how to make it performant without help from CUDF, and we have seen this in actual customer data. Perhaps I can write a custom kernel myself that looks at quotes and replaces values in quotes vs outside of quotes as needed. I'll see.

@GregoryKimball
Copy link
Contributor Author

We had more discussions on the JSON compatibility issues and identified "multi-line" support as a blocker (relates to 3 above). We don't currently have a way to process a strings column as JSON Lines when the rows contain unquoted newline characters. Also our whitespace normalization can't remove unquoted newline characters. (See #10267 and #15277 for related requests)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Status: In Progress
Status: Story Issue
Development

No branches or pull requests

2 participants