Use new getJsonObject kernel for json_tuple #10635

thirtiseven · 2024-03-26T08:06:05Z

This PR updates json_tuple with new getJsonObject kernel.

All current xfailed cases got passed:

./integration_tests/run_pyspark_from_build.sh -s -k json_tuple
......
============= 16 passed, 13 xpassed, 8 warnings in 49.36s ============

I think the performance will not be good because it calls getJsonObject kernel many times, which is not very fast by itself.

With the new json_parser in jni, I think we can implement a kernel for json_tuple to get much higher performance by passing all fields in one pass. So this PR will be a short-term workaround, even if it gets merged.

Depends on NVIDIA/spark-rapids-jni#1893

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2024-03-26T15:18:49Z

a quick perf test:

val data = Seq.fill(3000000)("""{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"basket":[[1,2,{"b":"y","a":"x"}],[3,4],[5,6]],"book":[{"author":"Nigel Rees","title":"Sayings of the Century","category":"reference","price":8.95},{"author":"Herman Melville","title":"Moby Dick","category":"fiction","price":8.99,"isbn":"0-553-21311-3"},{"author":"J. R. R. Tolkien","title":"The Lord of the Rings","category":"fiction","reader":[{"age":25,"name":"bob"},{"age":26,"name":"jack"}],"price":22.99,"isbn":"0-395-19395-8"}],"bicycle":{"price":19.95,"color":"red"}},"email":"amy@only_for_json_udf_test.net","owner":"amy","zip code":"94025","fb:testid":"1234"}""")

import spark.implicits._
data.toDF("a").write.mode("overwrite").parquet("JSON")

val df = spark.read.parquet("JSON")

spark.conf.set("spark.rapids.sql.expression.JsonTuple", true)

spark.time(df.selectExpr("json_tuple(a, 'store', 'reader', 'bicycle', 'owner', 'zip code', 'email')").count())

6 fields:
CPU: 65216 ms
GPU: 7851 ms

1 fields:
CPU: 68050 ms
GPU: 2070 ms

Wow, so it is actually quite fast. Not sure if I tested it right.

revans2 · 2024-03-26T17:46:39Z

Wow, so it is actually quite fast. Not sure if I tested it right.

A bit of feedback on the quick test.

Looks like you only ran it once. Cold runs are usually a lot slower than hot runs, but even then.
All of your data is the same. Which means that there is no thread divergence in the GPU.
You don't mention what CPU/system was used so it is hard to tell if it is a fair comparison or not.
It would be nice to see how fast the parquet read is compared to the CPU. all of the performance gains might be in that, just because it is a single string column repeated 3,000,000 times.

revans2

Might also be nice to have a follow on issue to see if we can drop the special field name checks.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2024-03-27T06:35:22Z

Might also be nice to have a follow on issue to see if we can drop the special field name checks.

Updated, the special field name checks are safe to drop.

thirtiseven · 2024-03-28T09:05:04Z

build

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2024-04-01T07:08:05Z

build

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2024-04-24T14:12:29Z

build

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2024-04-24T14:41:51Z

Verified again to generate doc. Seems that ./build/buildall only generates docs per shims, not the main ones.

thirtiseven · 2024-04-24T14:41:58Z

build

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2024-04-25T02:52:35Z

build

Use new getJsonObject kernel for json_tuple

e4f8050

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven mentioned this pull request Mar 26, 2024

Use new jni kernel for getJsonObject #10581

Merged

revans2 previously approved these changes Mar 26, 2024

View reviewed changes

fallback change, remove xfails and verify

e242666

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven dismissed revans2’s stale review via e242666 March 27, 2024 06:34

thirtiseven self-assigned this Mar 27, 2024

thirtiseven marked this pull request as ready for review March 27, 2024 06:34

sameerz added the task Work required that improves the product but is not user facing label Mar 27, 2024

thirtiseven added 3 commits April 1, 2024 12:16

verify

3b9e0ac

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Merge branch 'branch-24.04' into json-tuple-new-kernel

d61d339

Add a test case and reverify

5409d58

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven changed the base branch from branch-24.04 to branch-24.06 April 2, 2024 02:27

thirtiseven and others added 2 commits April 23, 2024 10:11

Merge branch 'NVIDIA:branch-24.06' into json-tuple-new-kernel

cfb0eeb

disable by default

2da8d5a

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven requested a review from revans2 April 24, 2024 09:36

revans2 previously approved these changes Apr 24, 2024

View reviewed changes

verify again

40790a1

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven dismissed revans2’s stale review via 40790a1 April 24, 2024 14:38

revans2 previously approved these changes Apr 24, 2024

View reviewed changes

add config back after off by default

6b74922

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven dismissed revans2’s stale review via 6b74922 April 25, 2024 02:36

revans2 approved these changes Apr 25, 2024

View reviewed changes

revans2 merged commit 63088f1 into NVIDIA:branch-24.06 Apr 25, 2024
43 of 44 checks passed

NvTimLiu mentioned this pull request Jun 5, 2024

Update latest changelog [skip ci] #10981

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use new getJsonObject kernel for json_tuple #10635

Use new getJsonObject kernel for json_tuple #10635

thirtiseven commented Mar 26, 2024 •

edited

Loading

thirtiseven commented Mar 26, 2024

revans2 commented Mar 26, 2024

revans2 left a comment

thirtiseven commented Mar 27, 2024 •

edited

Loading

thirtiseven commented Mar 28, 2024

thirtiseven commented Apr 1, 2024

thirtiseven commented Apr 24, 2024

thirtiseven commented Apr 24, 2024

thirtiseven commented Apr 24, 2024

thirtiseven commented Apr 25, 2024

Use new getJsonObject kernel for json_tuple #10635

Use new getJsonObject kernel for json_tuple #10635

Conversation

thirtiseven commented Mar 26, 2024 • edited Loading

thirtiseven commented Mar 26, 2024

revans2 commented Mar 26, 2024

revans2 left a comment

Choose a reason for hiding this comment

thirtiseven commented Mar 27, 2024 • edited Loading

thirtiseven commented Mar 28, 2024

thirtiseven commented Apr 1, 2024

thirtiseven commented Apr 24, 2024

thirtiseven commented Apr 24, 2024

thirtiseven commented Apr 24, 2024

thirtiseven commented Apr 25, 2024

thirtiseven commented Mar 26, 2024 •

edited

Loading

thirtiseven commented Mar 27, 2024 •

edited

Loading