-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] support json to struct function #8174
[FEA] support json to struct function #8174
Conversation
Signed-off-by: Cindy Jiang <cindyj@nvidia.com>
Signed-off-by: Cindy Jiang <cindyj@nvidia.com>
…ation Signed-off-by: Cindy Jiang <cindyj@nvidia.com>
Signed-off-by: Cindy Jiang <cindyj@nvidia.com>
build |
StructType([StructField("d", StringType())]), | ||
StructType([StructField("a", StringType()), StructField("b", StringType())]), | ||
StructType([StructField("c", LongType()), StructField("a", StringType())]), | ||
StructType([StructField("a", StringType()), StructField("a", StringType())]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only see tests for String, Long, and Struct. If we say that we support the other types we really should have tests for them. This needs to include things like STRUCT of STRUCTs and STURCTs of LISTS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we say that we support all of the types in our meta object, then we need tests for all of the data types that JSON supports in Spark
I personally would rather see us start with a few simple types and add more as we add tests for them. So if we have tests for String, Int, array and struct, then we should only say that we support those types. We can add in support for boolean, byte, short, long, decimal (which needs to include multiple precision and scale types), Float, Double, Timestamp, TimestampNTZ, Date, Binary, CalendarInterval, YearMonthInterval, DayTimeInterval, UDT and NullTypes when a customer/management asks for them or when we have tests that show that they are working correctly.
@@ -3373,14 +3373,15 @@ object GpuOverrides extends Logging { | |||
expr[JsonToStructs]( | |||
"Returns a struct value with the given `jsonStr` and `schema`", | |||
ExprChecks.projectOnly( | |||
TypeSig.MAP.nested(TypeSig.STRING), | |||
(TypeSig.STRUCT + TypeSig.MAP).nested(TypeSig.all), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a way so that MAP type is only supported if it is a MAP<STRING, STRING> and only if it is at the top level. Some of this can be done with a change to this line. But we need more than this and ideally have some tests to verify that we do fall back properly.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala
Show resolved
Hide resolved
val end = combinedHost.getEndListOffset(0) | ||
val length = end - start | ||
|
||
withResource(cudf.Table.readJSON(cudf.JSONOptions.DEFAULT, data, start, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is having CUDF do name and type inference. Is that really what we want? Should we do this like we do for regular JSON parsing? (Never mind turns out we do the same thing in the JSON reader??? What are we doing that? It is a huge waste of memory. Can we please file a follow on issue to fix it both here and in the JSON reader. Bonus points if we can combine the code reader code together.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala
Show resolved
Hide resolved
…ore tests Signed-off-by: Cindy Jiang <cindyj@nvidia.com>
build |
StructType([StructField("d", StringType())]), | ||
StructType([StructField("a", StringType()), StructField("b", StringType())]), | ||
StructType([StructField("c", LongType()), StructField("a", StringType())]), | ||
StructType([StructField("a", StringType()), StructField("a", StringType())]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we say that we support all of the types in our meta object, then we need tests for all of the data types that JSON supports in Spark
I personally would rather see us start with a few simple types and add more as we add tests for them. So if we have tests for String, Int, array and struct, then we should only say that we support those types. We can add in support for boolean, byte, short, long, decimal (which needs to include multiple precision and scale types), Float, Double, Timestamp, TimestampNTZ, Date, Binary, CalendarInterval, YearMonthInterval, DayTimeInterval, UDT and NullTypes when a customer/management asks for them or when we have tests that show that they are working correctly.
@@ -3373,14 +3373,24 @@ object GpuOverrides extends Logging { | |||
expr[JsonToStructs]( | |||
"Returns a struct value with the given `jsonStr` and `schema`", | |||
ExprChecks.projectOnly( | |||
TypeSig.MAP.nested(TypeSig.STRING), | |||
TypeSig.MAP.nested(TypeSig.STRING) + TypeSig.STRUCT.nested(TypeSig.all), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want a psNote for MAP to indicate what is really supported.
TypeSig.MAP.nested(TypeSig.STRING).withPsNote(TypeEnum.MAP, "MAP only supports keys and values that are of STRING type") + TypeSig.STRUCT.nested(TypeSig.all)
This is because the type checking has limitations. It does not keep the children of the MAP type and the children of a STRUCT type separate. So we want to make sure the auto-generated docs make it clear that we are doing extra checks.
if (i == -1) { | ||
GpuColumnVector.columnVectorFromNull(numRows, dtype) | ||
} else { | ||
rawTable.getColumn(i).castTo(GpuColumnVector.getRapidsType(dtype)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use GpuCast instead of castTo. IT does a lot more checking to make sure that we parse things correctly/etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add to this point, should add integration tests to test any failure conditions that are a result of casting. For example, if you pass in JSON that has arbitrary large number that you specify in your schema to use IntegerType, the same overflow exception that Spark would throw should be checked for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the implementation to use GpuCast.doCast
. I still observed incompatible CPU vs GPU behavior with arbitrarily large numbers in input JSON strings with IntegerType schema: CPU implementation returns null, and GPU implementation casts the large numbers to integers. I documented this observation in the compatibility.md
file. I have not yet found a way to work around this while not breaking current existing tests. If it is important to be compatible, I will investigate more into this. Thanks!
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala
Show resolved
Hide resolved
Signed-off-by: Cindy Jiang <cindyj@nvidia.com>
Signed-off-by: Cindy Jiang <cindyj@nvidia.com>
build |
import com.nvidia.spark.rapids.jni.MapUtils | ||
|
||
import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, Expression, NullIntolerant, TimeZoneAwareExpression} | ||
import org.apache.spark.sql.types.{AbstractDataType, DataType, StringType} | ||
// import org.apache.spark.sql.types.{AbstractDataType, DataType, MapType, StringType, StructType} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: delete commented out code.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala
Show resolved
Hide resolved
val existingNames = Set[String]() | ||
names.foldRight(Seq[(String, DataType)]())((elem, acc) => { | ||
val (name, dtype) = elem | ||
if (existingNames(name)) (null, dtype)+:acc else {existingNames += name; (name, dtype)+:acc}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could we make the formatting less dense so it is simpler to read?
names.foldRight(Seq.empty) { (elem, acc) =>
val (name, dtype) = elem
if (existingNames(name)) {
(null, dtype) +: acc
} else {
existingNames += name
(name, dtype) +: acc
}
}
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Cindy Jiang <cindyj@nvidia.com>
Signed-off-by: Cindy Jiang <cindyj@nvidia.com>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pending other reviews
We add support for
from_json
to a StructType.Example is as follows:
Note: we reuse some mechanism from #6211.
Follow up issue for JSON parsing is tracked here: #8204.