[BUG] GetJsonObject removes leading space characters #10215

revans2 · 2024-01-18T16:28:40Z

Describe the bug
Spark and https://jsonpath.com/ agree that white space should not be stripped from the path. Our implementation treats $. a and $.a the same, but no other implementation I have found does that. What is more, even if the string is quoted we still end up doing the same thing $[' b'] is the same as $['b'].

The text was updated successfully, but these errors were encountered:

thirtiseven · 2024-02-22T07:13:15Z

Ran some tests on current plugin with 3.4.1, gpu's results look the same as https://jsonpath.com/ 's, but not the same as cpu's, which treats $. a and $.a the same:

scala> val df = Seq("""{" a":"b","b":"c"}""", """{"a":"b","b":"c"}""", """{"a":"b"," a":"c"}""").toDF("value")
df: org.apache.spark.sql.DataFrame = [value: string]

scala> df.write.mode("overwrite").parquet("json_obj_data")

scala> val df2 = spark.read.parquet("json_obj_data")
df2: org.apache.spark.sql.DataFrame = [value: string]

scala> df2.selectExpr("value", "get_json_object(value, '$[\\\' a\\\']') as ba", "get_json_object(value, '$. a') as sa", "get_json_object(value, '$.a') as a").show()
+------------------+----+----+----+
|             value|  ba|  sa|   a|
+------------------+----+----+----+
|{" a":"b","b":"c"}|null|null|null|
|{"a":"b"," a":"c"}|   b|   b|   b|
| {"a":"b","b":"c"}|   b|   b|   b|
+------------------+----+----+----+


scala> spark.conf.set("spark.rapids.sql.enabled", true)

scala> spark.conf.set("spark.rapids.sql.expression.GetJsonObject", true)

scala> df2.selectExpr("value", "get_json_object(value, '$[\\\' a\\\']') as ba", "get_json_object(value, '$. a') as sa", "get_json_object(value, '$.a') as a").show()
24/02/22 14:50:49 WARN GpuOverrides:
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> get_json_object(value#94, $[' a']) AS ba#133 will run on GPU
      *Expression <GetJsonObject> get_json_object(value#94, $[' a']) will run on GPU
    *Expression <Alias> get_json_object(value#94, $. a) AS sa#134 will run on GPU
      *Expression <GetJsonObject> get_json_object(value#94, $. a) will run on GPU
    *Expression <Alias> get_json_object(value#94, $.a) AS a#135 will run on GPU
      *Expression <GetJsonObject> get_json_object(value#94, $.a) will run on GPU
    *Exec <FileSourceScanExec> will run on GPU

+------------------+----+----+----+
|             value|  ba|  sa|   a|
+------------------+----+----+----+
|{" a":"b","b":"c"}|   b|   b|null|
|{"a":"b"," a":"c"}|   c|   c|   b|
| {"a":"b","b":"c"}|null|null|   b|
+------------------+----+----+----+

Not sure if it's a bug of spark, we can match it in cudf as another get_json_object_options or do a pre-processing in plugin.

revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 18, 2024

revans2 changed the title ~~[BUG] GetJsonPath removes leading space characters~~ [BUG] GetJsonObject removes leading space characters Jan 23, 2024

revans2 mentioned this issue Jan 23, 2024

[FEA] Fix GetJsonObject #10254

Open

15 tasks

mattahrens removed the ? - Needs Triage Need team to review and classify label Jan 30, 2024

thirtiseven self-assigned this Feb 22, 2024

thirtiseven mentioned this issue Feb 22, 2024

Use parser from spark to normalize json path in GetJsonObject #10466

Merged

thirtiseven closed this as completed in #10466 Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] GetJsonObject removes leading space characters #10215

[BUG] GetJsonObject removes leading space characters #10215

revans2 commented Jan 18, 2024

thirtiseven commented Feb 22, 2024 •

edited

Loading

[BUG] GetJsonObject removes leading space characters #10215

[BUG] GetJsonObject removes leading space characters #10215

Comments

revans2 commented Jan 18, 2024

thirtiseven commented Feb 22, 2024 • edited Loading

thirtiseven commented Feb 22, 2024 •

edited

Loading