Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GetJsonObject removes leading space characters #10215

Closed
revans2 opened this issue Jan 18, 2024 · 1 comment · Fixed by #10466
Closed

[BUG] GetJsonObject removes leading space characters #10215

revans2 opened this issue Jan 18, 2024 · 1 comment · Fixed by #10466
Assignees
Labels
bug Something isn't working

Comments

@revans2
Copy link
Collaborator

revans2 commented Jan 18, 2024

Describe the bug
Spark and https://jsonpath.com/ agree that white space should not be stripped from the path. Our implementation treats $. a and $.a the same, but no other implementation I have found does that. What is more, even if the string is quoted we still end up doing the same thing $[' b'] is the same as $['b'].

@revans2 revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 18, 2024
@revans2 revans2 changed the title [BUG] GetJsonPath removes leading space characters [BUG] GetJsonObject removes leading space characters Jan 23, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Jan 30, 2024
@thirtiseven thirtiseven self-assigned this Feb 22, 2024
@thirtiseven
Copy link
Collaborator

thirtiseven commented Feb 22, 2024

Ran some tests on current plugin with 3.4.1, gpu's results look the same as https://jsonpath.com/ 's, but not the same as cpu's, which treats $. a and $.a the same:

scala> val df = Seq("""{" a":"b","b":"c"}""", """{"a":"b","b":"c"}""", """{"a":"b"," a":"c"}""").toDF("value")
df: org.apache.spark.sql.DataFrame = [value: string]

scala> df.write.mode("overwrite").parquet("json_obj_data")

scala> val df2 = spark.read.parquet("json_obj_data")
df2: org.apache.spark.sql.DataFrame = [value: string]

scala> df2.selectExpr("value", "get_json_object(value, '$[\\\' a\\\']') as ba", "get_json_object(value, '$. a') as sa", "get_json_object(value, '$.a') as a").show()
+------------------+----+----+----+
|             value|  ba|  sa|   a|
+------------------+----+----+----+
|{" a":"b","b":"c"}|null|null|null|
|{"a":"b"," a":"c"}|   b|   b|   b|
| {"a":"b","b":"c"}|   b|   b|   b|
+------------------+----+----+----+


scala> spark.conf.set("spark.rapids.sql.enabled", true)

scala> spark.conf.set("spark.rapids.sql.expression.GetJsonObject", true)

scala> df2.selectExpr("value", "get_json_object(value, '$[\\\' a\\\']') as ba", "get_json_object(value, '$. a') as sa", "get_json_object(value, '$.a') as a").show()
24/02/22 14:50:49 WARN GpuOverrides:
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> get_json_object(value#94, $[' a']) AS ba#133 will run on GPU
      *Expression <GetJsonObject> get_json_object(value#94, $[' a']) will run on GPU
    *Expression <Alias> get_json_object(value#94, $. a) AS sa#134 will run on GPU
      *Expression <GetJsonObject> get_json_object(value#94, $. a) will run on GPU
    *Expression <Alias> get_json_object(value#94, $.a) AS a#135 will run on GPU
      *Expression <GetJsonObject> get_json_object(value#94, $.a) will run on GPU
    *Exec <FileSourceScanExec> will run on GPU

+------------------+----+----+----+
|             value|  ba|  sa|   a|
+------------------+----+----+----+
|{" a":"b","b":"c"}|   b|   b|null|
|{"a":"b"," a":"c"}|   c|   c|   b|
| {"a":"b","b":"c"}|null|null|   b|
+------------------+----+----+----+

Not sure if it's a bug of spark, we can match it in cudf as another get_json_object_options or do a pre-processing in plugin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants