Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support parsing JSON data to include a Map type #14288

Closed
Tracked by #9458
revans2 opened this issue Oct 16, 2023 · 4 comments
Closed
Tracked by #9458

[FEA] Support parsing JSON data to include a Map type #14288

revans2 opened this issue Oct 16, 2023 · 4 comments
Assignees
Labels
3 - Ready for Review Ready for review by team cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Oct 16, 2023

Is your feature request related to a problem? Please describe.
CUDF does not support maps directly so I know this is a bit of a stretch. Parquet and ORC reading and writing in CUDF do support maps. When reading a map from a parquet or ORC file they are returned as a List of a Struct where the struct has two columns a key and a value. Writing is similar it takes a column formatted as a List of a Struct with a key and a value column, but inside of the metadata for the table write we tag it as being a map so that the parquet/ORC types can reflect that properly.

bool _list_column_is_map = false;

Spark also supports reading JSON data as a Map. Spark translates a JSON object into the map. The keys of the Map are the names of the fields in the object and the values are the values of the object. These Maps could appear at any level in the schema that Spark requests.

Describe the solution you'd like
Talking to @karthikeyann doing full support for parsing Maps is very different from what they do anywhere else. So a general solution is likely going to be difficult to do in a single step. The plan is to try and support this in multiple steps.

Step 1. Support Map<String,String> only.

#14239 will give us the ability to get back a JSON value formatted as a String. If we have the ability to request that a specific key, or the top level object, be parsed as a Map<String,String> then we could take the value column and recursively parse it.

So for example lets say we have JSON data like the following, and we want to parse it as a Map<String, Map<String, String>>

{"A":"B", "C": {"C_1": 100, "C_2": [1, 2, 3, 4, 5], "C_3": null}}
{"A": 90, "C": [1, 2, 3]}
{"C": {"C_1": "test", "C_90": "something", "C_3": "else"}}

Spark can do this with something like.

data.withColumn("parsed", from_json(col("json"), MapType(StringType, MapType(StringType, StringType))))

In this case the result would be

null
null
{C -> {C_1 -> test, C_90 -> something, C_3 -> else}}

The nulls are there because the values for "A" in the other rows cannot be parsed as a Map, but we will get to that. So if we have the ability to first parse it just as a Map<String,String> we can recursively call into that parsing to build up what we need.

Step one would be to parse the top level as a string to a string, and then we would have to parse the values returned also as a Map<String,String>. This mostly works, but we would need some cleanup afterwards to replace error columns with nulls at the top level.

data.withColumn("parsed_lvl_1", from_json(col("json"), MapType(StringType, StringType))).withColumn("parsed", transform_values(col("parsed_lvl_1"), (k: Column, v: Column) => from_json(v,MapType(StringType, StringType)))).show(false)
+-----------------------------------------------------------------+-------------------------------------------------------+---------------------------------------------------------------+
|json                                                             |parsed_lvl_1                                           |parsed                                                         |
+-----------------------------------------------------------------+-------------------------------------------------------+---------------------------------------------------------------+
|{"A":"B", "C": {"C_1": 100, "C_2": [1, 2, 3, 4, 5], "C_3": null}}|{A -> B, C -> {"C_1":100,"C_2":[1,2,3,4,5],"C_3":null}}|{A -> null, C -> {C_1 -> 100, C_2 -> [1,2,3,4,5], C_3 -> null}}|
|{"A": 90, "C": [1, 2, 3]}                                        |{A -> 90, C -> [1,2,3]}                                |{A -> null, C -> null}                                         |
|{"C": {"C_1": "test", "C_90": "something", "C_3": "else"}}       |{C -> {"C_1":"test","C_90":"something","C_3":"else"}}  |{C -> {C_1 -> test, C_90 -> something, C_3 -> else}}           |
+-----------------------------------------------------------------+-------------------------------------------------------+---------------------------------------------------------------+

Step 2 would be to try and expand this out so that the Map could be nested. But that would be as an optimization for the recursive parsing being done after step 1.

Describe alternatives you've considered
We thought about writing our own parser from scratch and we have done some of that already for Map<String,String> at the top level. I would really rather not have to expand it to support all types, which is what I think we would have to do it we could not get support for maps added to CUDF.

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Oct 16, 2023
@revans2
Copy link
Contributor Author

revans2 commented Oct 16, 2023

Just for reference here is a link to the code we currently have for parsing Map<String,String>

https://github.com/NVIDIA/spark-rapids-jni/blob/branch-23.12/src/main/cpp/src/map_utils.cu

@GregoryKimball GregoryKimball added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Nov 9, 2023
@GregoryKimball GregoryKimball added this to the Nested JSON reader milestone Nov 9, 2023
@karthikeyann
Copy link
Contributor

If the Map type is going to represented by a struct with "key" column and "value" column, then "value" column cannot have different types. So, it would force all value to be of same type. Is that also the current with parquet and orc map type support?

@karthikeyann karthikeyann changed the title [FEA] Support parsing JSON data to inlcude a Map type [FEA] Support parsing JSON data to include a Map type Feb 6, 2024
@karthikeyann
Copy link
Contributor

PR #14936 allows a struct to be forced to string through schema (when mixed types as string is enabled).
With #14954, it's possible to input a nested schema. So, it's possible to force any map as string. This string column could be further decoded to map type using existing utility.

@karthikeyann karthikeyann added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Feb 7, 2024
rapids-bot bot pushed a commit that referenced this issue Mar 11, 2024
Addresses part of #14288
Depends on  #14939 (mixed type ignore nulls fix)

In the input schema, if a struct column is given as STRING type, it's forced to be a STRING column.
This could be used to support map type in spark JSON reader. (Force a map type to be a STRING, and use different parser to extract this string column as key, value columns)
To enable this forcing, mixed type as string should be enabled in json_reader_options.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Andy Grove (https://github.com/andygrove)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Shruti Shivakumar (https://github.com/shrshi)
  - Bradley Dice (https://github.com/bdice)

URL: #14936
@GregoryKimball
Copy link
Contributor

I believe this is solved by #14936

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Archived in project
Archived in project
Development

No branches or pull requests

3 participants