[FEA] support json to struct function #8174

cindyyuanjiang · 2023-04-23T23:22:02Z

We add support for from_json to a StructType.

Example is as follows:

scala> val initial_df = Seq((1, """{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""))
initial_df: Seq[(Int, String)] = List((1,{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}))

scala> val df = initial_df.toDF("id", "value")
df: org.apache.spark.sql.DataFrame = [id: int, value: string]

scala> df.show(false)
+---+--------------------------------------------------------------------------+
|id |value                                                                     |
+---+--------------------------------------------------------------------------+
|1  |{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}|
+---+--------------------------------------------------------------------------+


scala> import org.apache.spark.sql.types.{StringType, StructType}
import org.apache.spark.sql.types.{StringType, StructType}

scala> val schema = new StructType().add("Zipcode", StringType, true).add("State", StringType, true)
schema: org.apache.spark.sql.types.StructType = StructType(StructField(Zipcode,StringType,true), StructField(State,StringType,true))

scala> val df_struct = df.withColumn("value",from_json(col("value"),schema))
df_struct: org.apache.spark.sql.DataFrame = [id: int, value: struct<Zipcode: string, State: string>]

scala> df_struct.show(false)
+---+---------+
|id |value    |
+---+---------+
|1  |{704, PR}|
+---+---------+

Note: we reuse some mechanism from #6211.
Follow up issue for JSON parsing is tracked here: #8204.

Signed-off-by: Cindy Jiang <cindyj@nvidia.com>

…ation Signed-off-by: Cindy Jiang <cindyj@nvidia.com>

Signed-off-by: Cindy Jiang <cindyj@nvidia.com>

cindyyuanjiang · 2023-04-23T23:33:18Z

build

revans2 · 2023-04-25T21:14:31Z

integration_tests/src/main/python/json_test.py

+                                    StructType([StructField("d", StringType())]),
+                                    StructType([StructField("a", StringType()), StructField("b", StringType())]),
+                                    StructType([StructField("c", LongType()), StructField("a", StringType())]),
+                                    StructType([StructField("a", StringType()), StructField("a", StringType())])


I only see tests for String, Long, and Struct. If we say that we support the other types we really should have tests for them. This needs to include things like STRUCT of STRUCTs and STURCTs of LISTS.

If we say that we support all of the types in our meta object, then we need tests for all of the data types that JSON supports in Spark

https://github.com/apache/spark/blob/4a238cd9d8e80eed06732fc52b1456cb5ece6652/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L193-L385

I personally would rather see us start with a few simple types and add more as we add tests for them. So if we have tests for String, Int, array and struct, then we should only say that we support those types. We can add in support for boolean, byte, short, long, decimal (which needs to include multiple precision and scale types), Float, Double, Timestamp, TimestampNTZ, Date, Binary, CalendarInterval, YearMonthInterval, DayTimeInterval, UDT and NullTypes when a customer/management asks for them or when we have tests that show that they are working correctly.

revans2 · 2023-04-25T21:17:15Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

@@ -3373,14 +3373,15 @@ object GpuOverrides extends Logging {
    expr[JsonToStructs](
      "Returns a struct value with the given `jsonStr` and `schema`",
      ExprChecks.projectOnly(
-        TypeSig.MAP.nested(TypeSig.STRING),
+        (TypeSig.STRUCT + TypeSig.MAP).nested(TypeSig.all),


We need a way so that MAP type is only supported if it is a MAP<STRING, STRING> and only if it is at the top level. Some of this can be done with a change to this line. But we need more than this and ideally have some tests to verify that we do fall back properly.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala

revans2 · 2023-04-25T21:29:52Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala

+            val end = combinedHost.getEndListOffset(0)
+            val length = end - start
+
+            withResource(cudf.Table.readJSON(cudf.JSONOptions.DEFAULT, data, start,


This is having CUDF do name and type inference. Is that really what we want? Should we do this like we do for regular JSON parsing? (Never mind turns out we do the same thing in the JSON reader??? What are we doing that? It is a huge waste of memory. Can we please file a follow on issue to fix it both here and in the JSON reader. Bonus points if we can combine the code reader code together.

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala

…ore tests Signed-off-by: Cindy Jiang <cindyj@nvidia.com>

cindyyuanjiang · 2023-04-27T09:20:50Z

build

revans2 · 2023-04-27T14:36:33Z

integration_tests/src/main/python/json_test.py

+                                    StructType([StructField("d", StringType())]),
+                                    StructType([StructField("a", StringType()), StructField("b", StringType())]),
+                                    StructType([StructField("c", LongType()), StructField("a", StringType())]),
+                                    StructType([StructField("a", StringType()), StructField("a", StringType())])


If we say that we support all of the types in our meta object, then we need tests for all of the data types that JSON supports in Spark

https://github.com/apache/spark/blob/4a238cd9d8e80eed06732fc52b1456cb5ece6652/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L193-L385

I personally would rather see us start with a few simple types and add more as we add tests for them. So if we have tests for String, Int, array and struct, then we should only say that we support those types. We can add in support for boolean, byte, short, long, decimal (which needs to include multiple precision and scale types), Float, Double, Timestamp, TimestampNTZ, Date, Binary, CalendarInterval, YearMonthInterval, DayTimeInterval, UDT and NullTypes when a customer/management asks for them or when we have tests that show that they are working correctly.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

revans2 · 2023-04-27T14:43:53Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

@@ -3373,14 +3373,24 @@ object GpuOverrides extends Logging {
    expr[JsonToStructs](
      "Returns a struct value with the given `jsonStr` and `schema`",
      ExprChecks.projectOnly(
-        TypeSig.MAP.nested(TypeSig.STRING),
+        TypeSig.MAP.nested(TypeSig.STRING) + TypeSig.STRUCT.nested(TypeSig.all),


We want a psNote for MAP to indicate what is really supported.

TypeSig.MAP.nested(TypeSig.STRING).withPsNote(TypeEnum.MAP, "MAP only supports keys and values that are of STRING type") + TypeSig.STRUCT.nested(TypeSig.all)

This is because the type checking has limitations. It does not keep the children of the MAP type and the children of a STRUCT type separate. So we want to make sure the auto-generated docs make it clear that we are doing extra checks.

revans2 · 2023-04-27T15:01:52Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala

+              if (i == -1) {
+                GpuColumnVector.columnVectorFromNull(numRows, dtype)
+              } else {
+                rawTable.getColumn(i).castTo(GpuColumnVector.getRapidsType(dtype))


Please use GpuCast instead of castTo. IT does a lot more checking to make sure that we parse things correctly/etc.

To add to this point, should add integration tests to test any failure conditions that are a result of casting. For example, if you pass in JSON that has arbitrary large number that you specify in your schema to use IntegerType, the same overflow exception that Spark would throw should be checked for.

I updated the implementation to use GpuCast.doCast. I still observed incompatible CPU vs GPU behavior with arbitrarily large numbers in input JSON strings with IntegerType schema: CPU implementation returns null, and GPU implementation casts the large numbers to integers. I documented this observation in the compatibility.md file. I have not yet found a way to work around this while not breaking current existing tests. If it is important to be compatible, I will investigate more into this. Thanks!

integration_tests/src/main/python/json_test.py

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala

Signed-off-by: Cindy Jiang <cindyj@nvidia.com>

cindyyuanjiang · 2023-04-28T07:32:58Z

build

revans2 · 2023-04-28T15:47:42Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala

 import com.nvidia.spark.rapids.jni.MapUtils

 import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, Expression, NullIntolerant, TimeZoneAwareExpression}
-import org.apache.spark.sql.types.{AbstractDataType, DataType, StringType}
+// import org.apache.spark.sql.types.{AbstractDataType, DataType, MapType, StringType, StructType}


nit: delete commented out code.

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala

revans2 · 2023-04-28T15:52:55Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala

+    val existingNames = Set[String]()
+    names.foldRight(Seq[(String, DataType)]())((elem, acc) => {
+      val (name, dtype) = elem
+      if (existingNames(name)) (null, dtype)+:acc else {existingNames += name; (name, dtype)+:acc}})


nit: could we make the formatting less dense so it is simpler to read?

names.foldRight(Seq.empty) { (elem, acc) => val (name, dtype) = elem if (existingNames(name)) { (null, dtype) +: acc } else { existingNames += name (name, dtype) +: acc } }

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala

Signed-off-by: Cindy Jiang <cindyj@nvidia.com>

cindyyuanjiang · 2023-04-28T21:31:25Z

build

gerashegalov

LGTM, pending other reviews

cindyyuanjiang added 5 commits April 17, 2023 10:38

experimenting with json to struct

ab4fd03

Signed-off-by: Cindy Jiang <cindyj@nvidia.com>

added json to struct function implementation and integration tests

c91be11

Signed-off-by: Cindy Jiang <cindyj@nvidia.com>

fixed memory leak in json to struct implementation and added document…

1488f78

…ation Signed-off-by: Cindy Jiang <cindyj@nvidia.com>

Merge branch 'branch-23.06' into gpu-json-to-struct

4abc067

fixed Arm import

0fd5b0b

Signed-off-by: Cindy Jiang <cindyj@nvidia.com>

cindyyuanjiang self-assigned this Apr 23, 2023

cindyyuanjiang added the feature request New feature or request label Apr 23, 2023

cindyyuanjiang linked an issue Apr 23, 2023 that may be closed by this pull request

[FEA] support json to struct function #7867

Closed

cindyyuanjiang marked this pull request as ready for review April 24, 2023 18:00

cindyyuanjiang changed the title ~~WIP: [FEA] support json to struct function~~ [FEA] support json to struct function Apr 24, 2023

revans2 reviewed Apr 25, 2023

View reviewed changes

gerashegalov reviewed Apr 25, 2023

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala Show resolved Hide resolved

updated json to struct implementation to cast return type and added m…

4050d91

…ore tests Signed-off-by: Cindy Jiang <cindyj@nvidia.com>

revans2 reviewed Apr 27, 2023

View reviewed changes

gerashegalov reviewed Apr 27, 2023

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala Show resolved Hide resolved

cindyyuanjiang added 2 commits April 27, 2023 16:27

updated json to struct supported types and documentation

a1ee3d1

Signed-off-by: Cindy Jiang <cindyj@nvidia.com>

updated implementation to use GpuCast and updated tests

de2d793

Signed-off-by: Cindy Jiang <cindyj@nvidia.com>

revans2 reviewed Apr 28, 2023

View reviewed changes

gerashegalov reviewed Apr 28, 2023

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala Outdated Show resolved Hide resolved

fixed code style and removed unused imports

05e5ba4

Signed-off-by: Cindy Jiang <cindyj@nvidia.com>

cindyyuanjiang mentioned this pull request Apr 28, 2023

Fix JSON parsing in json to struct function #8204

Closed

added some comments for getSparkType function

906b36b

Signed-off-by: Cindy Jiang <cindyj@nvidia.com>

cindyyuanjiang requested review from revans2 and gerashegalov April 28, 2023 21:05

gerashegalov approved these changes Apr 29, 2023

View reviewed changes

revans2 approved these changes May 1, 2023

View reviewed changes

sameerz merged commit 2b2835e into NVIDIA:branch-23.06 May 1, 2023

cindyyuanjiang deleted the gpu-json-to-struct branch May 1, 2023 17:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] support json to struct function #8174

[FEA] support json to struct function #8174

cindyyuanjiang commented Apr 23, 2023 •

edited

Loading

cindyyuanjiang commented Apr 23, 2023

revans2 Apr 25, 2023

revans2 Apr 27, 2023

revans2 Apr 25, 2023

revans2 Apr 25, 2023

cindyyuanjiang commented Apr 27, 2023

revans2 Apr 27, 2023

revans2 Apr 27, 2023

revans2 Apr 27, 2023

NVnavkumar Apr 27, 2023

cindyyuanjiang Apr 28, 2023

cindyyuanjiang commented Apr 28, 2023

revans2 Apr 28, 2023

revans2 Apr 28, 2023

cindyyuanjiang commented Apr 28, 2023

gerashegalov left a comment

[FEA] support json to struct function #8174

[FEA] support json to struct function #8174

Conversation

cindyyuanjiang commented Apr 23, 2023 • edited Loading

cindyyuanjiang commented Apr 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cindyyuanjiang commented Apr 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cindyyuanjiang commented Apr 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cindyyuanjiang commented Apr 28, 2023

gerashegalov left a comment

Choose a reason for hiding this comment

cindyyuanjiang commented Apr 23, 2023 •

edited

Loading