Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] support json to struct function #8174

Merged
merged 10 commits into from
May 1, 2023

Conversation

cindyyuanjiang
Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang commented Apr 23, 2023

We add support for from_json to a StructType.

Example is as follows:

scala> val initial_df = Seq((1, """{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""))
initial_df: Seq[(Int, String)] = List((1,{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}))

scala> val df = initial_df.toDF("id", "value")
df: org.apache.spark.sql.DataFrame = [id: int, value: string]

scala> df.show(false)
+---+--------------------------------------------------------------------------+
|id |value                                                                     |
+---+--------------------------------------------------------------------------+
|1  |{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}|
+---+--------------------------------------------------------------------------+


scala> import org.apache.spark.sql.types.{StringType, StructType}
import org.apache.spark.sql.types.{StringType, StructType}

scala> val schema = new StructType().add("Zipcode", StringType, true).add("State", StringType, true)
schema: org.apache.spark.sql.types.StructType = StructType(StructField(Zipcode,StringType,true), StructField(State,StringType,true))

scala> val df_struct = df.withColumn("value",from_json(col("value"),schema))
df_struct: org.apache.spark.sql.DataFrame = [id: int, value: struct<Zipcode: string, State: string>]

scala> df_struct.show(false)
+---+---------+
|id |value    |
+---+---------+
|1  |{704, PR}|
+---+---------+

Note: we reuse some mechanism from #6211.
Follow up issue for JSON parsing is tracked here: #8204.

Signed-off-by: Cindy Jiang <cindyj@nvidia.com>
Signed-off-by: Cindy Jiang <cindyj@nvidia.com>
…ation

Signed-off-by: Cindy Jiang <cindyj@nvidia.com>
Signed-off-by: Cindy Jiang <cindyj@nvidia.com>
@cindyyuanjiang cindyyuanjiang self-assigned this Apr 23, 2023
@cindyyuanjiang cindyyuanjiang added the feature request New feature or request label Apr 23, 2023
@cindyyuanjiang cindyyuanjiang linked an issue Apr 23, 2023 that may be closed by this pull request
@cindyyuanjiang
Copy link
Collaborator Author

build

@cindyyuanjiang cindyyuanjiang marked this pull request as ready for review April 24, 2023 18:00
@cindyyuanjiang cindyyuanjiang changed the title WIP: [FEA] support json to struct function [FEA] support json to struct function Apr 24, 2023
StructType([StructField("d", StringType())]),
StructType([StructField("a", StringType()), StructField("b", StringType())]),
StructType([StructField("c", LongType()), StructField("a", StringType())]),
StructType([StructField("a", StringType()), StructField("a", StringType())])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only see tests for String, Long, and Struct. If we say that we support the other types we really should have tests for them. This needs to include things like STRUCT of STRUCTs and STURCTs of LISTS.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we say that we support all of the types in our meta object, then we need tests for all of the data types that JSON supports in Spark

https://github.com/apache/spark/blob/4a238cd9d8e80eed06732fc52b1456cb5ece6652/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L193-L385

I personally would rather see us start with a few simple types and add more as we add tests for them. So if we have tests for String, Int, array and struct, then we should only say that we support those types. We can add in support for boolean, byte, short, long, decimal (which needs to include multiple precision and scale types), Float, Double, Timestamp, TimestampNTZ, Date, Binary, CalendarInterval, YearMonthInterval, DayTimeInterval, UDT and NullTypes when a customer/management asks for them or when we have tests that show that they are working correctly.

@@ -3373,14 +3373,15 @@ object GpuOverrides extends Logging {
expr[JsonToStructs](
"Returns a struct value with the given `jsonStr` and `schema`",
ExprChecks.projectOnly(
TypeSig.MAP.nested(TypeSig.STRING),
(TypeSig.STRUCT + TypeSig.MAP).nested(TypeSig.all),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a way so that MAP type is only supported if it is a MAP<STRING, STRING> and only if it is at the top level. Some of this can be done with a change to this line. But we need more than this and ideally have some tests to verify that we do fall back properly.

val end = combinedHost.getEndListOffset(0)
val length = end - start

withResource(cudf.Table.readJSON(cudf.JSONOptions.DEFAULT, data, start,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is having CUDF do name and type inference. Is that really what we want? Should we do this like we do for regular JSON parsing? (Never mind turns out we do the same thing in the JSON reader??? What are we doing that? It is a huge waste of memory. Can we please file a follow on issue to fix it both here and in the JSON reader. Bonus points if we can combine the code reader code together.

…ore tests

Signed-off-by: Cindy Jiang <cindyj@nvidia.com>
@cindyyuanjiang
Copy link
Collaborator Author

build

StructType([StructField("d", StringType())]),
StructType([StructField("a", StringType()), StructField("b", StringType())]),
StructType([StructField("c", LongType()), StructField("a", StringType())]),
StructType([StructField("a", StringType()), StructField("a", StringType())])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we say that we support all of the types in our meta object, then we need tests for all of the data types that JSON supports in Spark

https://github.com/apache/spark/blob/4a238cd9d8e80eed06732fc52b1456cb5ece6652/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L193-L385

I personally would rather see us start with a few simple types and add more as we add tests for them. So if we have tests for String, Int, array and struct, then we should only say that we support those types. We can add in support for boolean, byte, short, long, decimal (which needs to include multiple precision and scale types), Float, Double, Timestamp, TimestampNTZ, Date, Binary, CalendarInterval, YearMonthInterval, DayTimeInterval, UDT and NullTypes when a customer/management asks for them or when we have tests that show that they are working correctly.

@@ -3373,14 +3373,24 @@ object GpuOverrides extends Logging {
expr[JsonToStructs](
"Returns a struct value with the given `jsonStr` and `schema`",
ExprChecks.projectOnly(
TypeSig.MAP.nested(TypeSig.STRING),
TypeSig.MAP.nested(TypeSig.STRING) + TypeSig.STRUCT.nested(TypeSig.all),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want a psNote for MAP to indicate what is really supported.

TypeSig.MAP.nested(TypeSig.STRING).withPsNote(TypeEnum.MAP, "MAP only supports keys and values that are of STRING type") + TypeSig.STRUCT.nested(TypeSig.all)

This is because the type checking has limitations. It does not keep the children of the MAP type and the children of a STRUCT type separate. So we want to make sure the auto-generated docs make it clear that we are doing extra checks.

if (i == -1) {
GpuColumnVector.columnVectorFromNull(numRows, dtype)
} else {
rawTable.getColumn(i).castTo(GpuColumnVector.getRapidsType(dtype))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use GpuCast instead of castTo. IT does a lot more checking to make sure that we parse things correctly/etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add to this point, should add integration tests to test any failure conditions that are a result of casting. For example, if you pass in JSON that has arbitrary large number that you specify in your schema to use IntegerType, the same overflow exception that Spark would throw should be checked for.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the implementation to use GpuCast.doCast. I still observed incompatible CPU vs GPU behavior with arbitrarily large numbers in input JSON strings with IntegerType schema: CPU implementation returns null, and GPU implementation casts the large numbers to integers. I documented this observation in the compatibility.md file. I have not yet found a way to work around this while not breaking current existing tests. If it is important to be compatible, I will investigate more into this. Thanks!

Signed-off-by: Cindy Jiang <cindyj@nvidia.com>
Signed-off-by: Cindy Jiang <cindyj@nvidia.com>
@cindyyuanjiang
Copy link
Collaborator Author

build

import com.nvidia.spark.rapids.jni.MapUtils

import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, Expression, NullIntolerant, TimeZoneAwareExpression}
import org.apache.spark.sql.types.{AbstractDataType, DataType, StringType}
// import org.apache.spark.sql.types.{AbstractDataType, DataType, MapType, StringType, StructType}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: delete commented out code.

val existingNames = Set[String]()
names.foldRight(Seq[(String, DataType)]())((elem, acc) => {
val (name, dtype) = elem
if (existingNames(name)) (null, dtype)+:acc else {existingNames += name; (name, dtype)+:acc}})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we make the formatting less dense so it is simpler to read?

names.foldRight(Seq.empty) { (elem, acc) =>
  val (name, dtype) = elem
  if (existingNames(name)) {
    (null, dtype) +: acc
  } else {
    existingNames += name
    (name, dtype) +: acc
  }
}

Signed-off-by: Cindy Jiang <cindyj@nvidia.com>
Signed-off-by: Cindy Jiang <cindyj@nvidia.com>
@cindyyuanjiang
Copy link
Collaborator Author

build

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending other reviews

@sameerz sameerz merged commit 2b2835e into NVIDIA:branch-23.06 May 1, 2023
@cindyyuanjiang cindyyuanjiang deleted the gpu-json-to-struct branch May 1, 2023 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] support json to struct function
5 participants