-
Notifications
You must be signed in to change notification settings - Fork 886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support parsing JSON data to include a Map type #14288
Comments
Just for reference here is a link to the code we currently have for parsing Map<String,String> https://github.com/NVIDIA/spark-rapids-jni/blob/branch-23.12/src/main/cpp/src/map_utils.cu |
If the Map type is going to represented by a struct with "key" column and "value" column, then "value" column cannot have different types. So, it would force all value to be of same type. Is that also the current with parquet and orc map type support? |
Addresses part of #14288 Depends on #14939 (mixed type ignore nulls fix) In the input schema, if a struct column is given as STRING type, it's forced to be a STRING column. This could be used to support map type in spark JSON reader. (Force a map type to be a STRING, and use different parser to extract this string column as key, value columns) To enable this forcing, mixed type as string should be enabled in json_reader_options. Authors: - Karthikeyan (https://github.com/karthikeyann) - Nghia Truong (https://github.com/ttnghia) Approvers: - Andy Grove (https://github.com/andygrove) - Mike Wilson (https://github.com/hyperbolic2346) - Shruti Shivakumar (https://github.com/shrshi) - Bradley Dice (https://github.com/bdice) URL: #14936
I believe this is solved by #14936 |
Is your feature request related to a problem? Please describe.
CUDF does not support maps directly so I know this is a bit of a stretch. Parquet and ORC reading and writing in CUDF do support maps. When reading a map from a parquet or ORC file they are returned as a List of a Struct where the struct has two columns a key and a value. Writing is similar it takes a column formatted as a List of a Struct with a key and a value column, but inside of the metadata for the table write we tag it as being a map so that the parquet/ORC types can reflect that properly.
cudf/cpp/include/cudf/io/types.hpp
Line 570 in d590e0b
Spark also supports reading JSON data as a Map. Spark translates a JSON object into the map. The keys of the Map are the names of the fields in the object and the values are the values of the object. These Maps could appear at any level in the schema that Spark requests.
Describe the solution you'd like
Talking to @karthikeyann doing full support for parsing Maps is very different from what they do anywhere else. So a general solution is likely going to be difficult to do in a single step. The plan is to try and support this in multiple steps.
Step 1. Support Map<String,String> only.
#14239 will give us the ability to get back a JSON value formatted as a String. If we have the ability to request that a specific key, or the top level object, be parsed as a Map<String,String> then we could take the value column and recursively parse it.
So for example lets say we have JSON data like the following, and we want to parse it as a Map<String, Map<String, String>>
Spark can do this with something like.
In this case the result would be
The nulls are there because the values for "A" in the other rows cannot be parsed as a Map, but we will get to that. So if we have the ability to first parse it just as a Map<String,String> we can recursively call into that parsing to build up what we need.
Step one would be to parse the top level as a string to a string, and then we would have to parse the values returned also as a Map<String,String>. This mostly works, but we would need some cleanup afterwards to replace error columns with nulls at the top level.
Step 2 would be to try and expand this out so that the Map could be nested. But that would be as an optimization for the recursive parsing being done after step 1.
Describe alternatives you've considered
We thought about writing our own parser from scratch and we have done some of that already for Map<String,String> at the top level. I would really rather not have to expand it to support all types, which is what I think we would have to do it we could not get support for maps added to CUDF.
The text was updated successfully, but these errors were encountered: