Enable to_date (via gettimestamp and casting timestamp to date) for non-UTC time zones #10100

NVnavkumar · 2023-12-27T18:39:11Z

This enables to_date for non-UTC time zones. to_date(str, fmt) is actually an alias for cast(gettimestamp(str, fmt) as date). So enable casting timestamp to date, and enable non-UTC time zones for gettimestamp (which basically the parent of the same algorithm used in unix_timestamp).

…imestamp) Signed-off-by: Navin Kumar <navink@nvidia.com>

Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar · 2023-12-27T23:42:01Z

build

Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar · 2023-12-28T06:39:25Z

build

NVnavkumar · 2023-12-28T06:45:12Z

Some performance testing results:

    NUM ROWS     GPU (ms)     CPU (ms)              SPEEDUP
        1000          177          250   1.4124293785310735
        2500           38           37   0.9736842105263158
        5000           40           39                0.975
       10000           44           37   0.8409090909090909
       20000           45           27                  0.6
       50000           53           71   1.3396226415094339
      100000           85          168   1.9764705882352942
      250000          134          267    1.992537313432836
      500000          176          463   2.6306818181818183
     1000000          293          955     3.25938566552901

In this case, I set the session timezone to Iran and generated a parquet file with a single column a of a randomly generated date strings in the yyyy-MM-dd format in the valid ANSI range (0001-01-01 to 9999-12-31). I then read that parquet file and ran to_date(a, 'yyyy-MM-dd') on that column with a CPU and GPU run. I used sparkmeasure to measure the elapsed time of the queries. Before running the test I also ran a query using from_utc_timestamp to load the timezone database since it's still lazy loaded at this point, and that would skew the performance data here.

NVnavkumar · 2023-12-28T06:45:24Z

build

winningsix · 2023-12-28T09:58:43Z

Some performance testing results:
    NUM ROWS     GPU (ms)     CPU (ms)              SPEEDUP
        1000          177          250   1.4124293785310735
        2500           38           37   0.9736842105263158
        5000           40           39                0.975
       10000           44           37   0.8409090909090909
       20000           45           27                  0.6
       50000           53           71   1.3396226415094339
      100000           85          168   1.9764705882352942
      250000          134          267    1.992537313432836
      500000          176          463   2.6306818181818183
     1000000          293          955     3.25938566552901
In this case, I set the session timezone to Iran and generated a parquet file with a single column a of a randomly generated date strings in the yyyy-MM-dd format in the valid ANSI range (0001-01-01 to 9999-12-31). I then read that parquet file and ran to_date(a, 'yyyy-MM-dd') on that column with a CPU and GPU run. I used sparkmeasure to measure the elapsed time of the queries. Before running the test I also ran a query using from_utc_timestamp to load the timezone database since it's still lazy loaded at this point, and that would skew the performance data here.

Do we have some ideas why row num = 20000, we had some lower perf acceleration? Will this perf result vary across different measurements?

revans2 · 2023-12-28T17:41:18Z

Do we have some ideas why row num = 20000, we had some lower perf acceleration? Will this perf result vary across different measurements?

That is way too little data to be a good benchmark. 20,000 rows is < 300k of data. I don't know how many row groups there are, but the overhead of launching kernels is likely most of the time being spent here. I would much rather see a test like.

spark.range(0, 400000000L, 1, 12).selectExpr("date_from_unix_date(id % 700000) as d").selectExpr("d", "CAST(d as STRING) as ds").write.mode("overwrite").parquet("./target/TMP")

then we can try and isolate just the to_date as much as possible with

spark.time(spark.read.parquet("./target/TMP").selectExpr("to_date(ds, 'yyyy-MM-dd') as date").selectExpr("MAX(date)", "MIN(date)").show())

and

spark.time(spark.read.parquet("./target/TMP").selectExpr("d as date").selectExpr("MAX(date)", "MIN(date)").show())

For UTC on the GPU I get
to_date => 1459, 1478, 1393, 1420, 1393 - median 1420
just min/max => 811, 831, 783, 753, 764 - median 753 or about 667 ms for just the to_date which is not too far off from the approximate 1s optime I see on the GPU for that ProjectExec.

For the CPU I get
to_date => 30273, 29722, 29750, 29645, 29612 - median 29722
just min/max => 673, 570, 536, 529, 541 - median 536 or about 29700 ms for just the to_date.

Or we could just compare UTC runs agains Iran runs for GPU and separately for CPU to give us an idea of the overhead that Iran adds to the operation.

NVnavkumar · 2023-12-29T02:01:44Z

Using the approach suggested by @revans2, I generated the parquet file using

spark.range(0, num_rows, 1, 12).selectExpr("date_from_unix_date(id % 700000) as d").selectExpr("d", "CAST(d as STRING) as ds").write.mode("overwrite").parquet(tmpfile)

and then measured the difference between the total time of min(date), max(date) and the total_time to_date and min/max.

Resuilts:

    NUM ROWS     GPU (ms)     CPU (ms)              SPEEDUP
   100000000          143        14866   103.95804195804196
   200000000          448        28896                 64.5
   400000000          883        58499    66.25028312570781
   800000000         1483       115150    77.64666217127444

I verified the last result with the Spark UI. Looks like 1483 ms is consistent with the 1.4 s shown in the GpuProject operator here:

NVnavkumar · 2023-12-29T02:22:29Z

Moving this to draft, looks like there is an issue in casting timestamp to date that is causing off by 1 issues when making it timezone aware.

Signed-off-by: Navin Kumar <navink@nvidia.com>

res-life · 2023-12-29T06:58:59Z

We can use this class to test perf:

https://github.com/NVIDIA/spark-rapids/blob/branch-24.02/tests/src/test/scala/com/nvidia/spark/rapids/timezone/TimeZonePerfSuite.scala

Add a new case in this suite.

…_date-non-utc-tz

Signed-off-by: Navin Kumar <navink@nvidia.com>

…to date Signed-off-by: Navin Kumar <navink@nvidia.com>

Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar · 2023-12-29T07:49:31Z

Updated results after casting timestamp to date fix for non-UTC timezones. The timezone tested was Iran:

    NUM ROWS     GPU (ms)     CPU (ms)              SPEEDUP
   100000000          195        14563    74.68205128205128
   200000000          456        27386    60.05701754385965
   400000000          847        52842    62.38724911452184
   800000000         1581       107031    67.69829222011386

Verified the last item (800,000,000 rows) in Spark UI:

revans2 · 2023-12-29T15:19:48Z

build

…_date-non-utc-tz

NVnavkumar · 2023-12-29T17:48:53Z

build

NVnavkumar · 2023-12-29T19:51:46Z

premerge failure due to some issue with Gpu Time Zone database and unit tests, filed #10129 to track.

NVnavkumar · 2023-12-30T06:42:18Z

This is blocked until NVIDIA/spark-rapids-jni#1670 is merged

…_date-non-utc-tz

NVnavkumar · 2024-01-02T21:32:15Z

Also fixes #10006

scala> val df = Seq("2023-12-31 23:59:59").toDF("ts")
scala> spark.conf.set("spark.sql.session.timeZone", "Asia/Shanghai")
scala> df.selectExpr("to_timestamp(ts)").show()
24/01/02 21:30:26 WARN GpuOverrides:
! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
  @Expression <AttributeReference> toprettystring(to_timestamp(ts))#9 could run on GPU

+-------------------+
|   to_timestamp(ts)|
+-------------------+
|2023-12-31 23:59:59|
+-------------------+
scala> val df2 = df.selectExpr("to_timestamp(ts)")
scala> spark.conf.set("spark.sql.session.timeZone", "UTC")

scala> df2.show()
24/01/02 21:31:15 WARN GpuOverrides:
! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
  @Expression <AttributeReference> toprettystring(to_timestamp(ts))#23 could run on GPU

+-------------------+
|   to_timestamp(ts)|
+-------------------+
|2023-12-31 15:59:59|
+-------------------+

…nsition times Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar · 2024-01-03T22:04:12Z

build

Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar · 2024-01-03T23:17:37Z

build

Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar · 2024-01-04T05:07:35Z

build

NVnavkumar added 2 commits December 26, 2023 15:33

Enable ToDate (which is basically a subset of Gpu casting string to t…

cf97edb

…imestamp) Signed-off-by: Navin Kumar <navink@nvidia.com>

Add tz_sensitive marker

e486d15

Signed-off-by: Navin Kumar <navink@nvidia.com>

Fix missing imports

f834113

Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar marked this pull request as ready for review December 28, 2023 06:45

NVnavkumar marked this pull request as draft December 29, 2023 02:22

Update GpuCast to handle time zone aware casting of timestamp to date

18cef15

Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar added 2 commits December 28, 2023 23:12

Merge branch 'branch-24.02' of github.com:NVIDIA/spark-rapids into to…

b7acfe5

…_date-non-utc-tz

Fix scalastyle

db5feac

Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar marked this pull request as ready for review December 29, 2023 07:17

NVnavkumar added 3 commits December 28, 2023 23:25

Remove unncessary restriction on timestamp_gen for casting timestamp …

6706e2f

…to date Signed-off-by: Navin Kumar <navink@nvidia.com>

Update test to be more thorough for to_date

35e6687

Signed-off-by: Navin Kumar <navink@nvidia.com>

Cleanup formatting

b55a4bf

Signed-off-by: Navin Kumar <navink@nvidia.com>

revans2 previously approved these changes Dec 29, 2023

View reviewed changes

Merge branch 'branch-24.02' of github.com:NVIDIA/spark-rapids into to…

89c5b78

…_date-non-utc-tz

sameerz added the feature request New feature or request label Dec 29, 2023

NVnavkumar mentioned this pull request Dec 30, 2023

[BUG] Unit test suite fails with Null data pointer in GpuTimeZoneDB #10129

Closed

Merge branch 'branch-24.02' of github.com:NVIDIA/spark-rapids into to…

e4237f9

…_date-non-utc-tz

NVnavkumar mentioned this pull request Jan 2, 2024

[TASK] Benchmark existing timestamp functions that work in non-UTC time zone (non-DST) #10142

Closed

8 tasks

NVnavkumar linked an issue Jan 2, 2024 that may be closed by this pull request

[FEA] Support ParseToTimestamp for non-UTC time zones #10006

Closed

GetTimestamp fix.

02eac1d

NVnavkumar dismissed revans2’s stale review via 02eac1d January 3, 2024 07:40

Add test for to_timestamp and add special cases for Asia/Shanghai tra…

743aef2

…nsition times Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar requested a review from revans2 January 3, 2024 22:06

revans2 previously approved these changes Jan 3, 2024

View reviewed changes

Some typo snuck into this PR, fixing

dddc479

Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar dismissed revans2’s stale review via dddc479 January 3, 2024 23:17

NVnavkumar requested a review from revans2 January 3, 2024 23:18

test was missing the skip for non-supported timezones

1579db2

Signed-off-by: Navin Kumar <navink@nvidia.com>

res-life mentioned this pull request Jan 4, 2024

[FEA] New kernel to support parsing dates/timestamps string with a timezone parameter. NVIDIA/spark-rapids-jni#1655

Closed

3 tasks

revans2 approved these changes Jan 4, 2024

View reviewed changes

revans2 merged commit 240d661 into NVIDIA:branch-24.02 Jan 4, 2024
39 checks passed

andygrove mentioned this pull request Jan 9, 2024

[BUG] test_cast_timestamp_to_date fails with TZ=Asia/Hebron #10170

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable to_date (via gettimestamp and casting timestamp to date) for non-UTC time zones #10100

Enable to_date (via gettimestamp and casting timestamp to date) for non-UTC time zones #10100

NVnavkumar commented Dec 27, 2023 •

edited

Loading

NVnavkumar commented Dec 27, 2023

NVnavkumar commented Dec 28, 2023

NVnavkumar commented Dec 28, 2023 •

edited

Loading

NVnavkumar commented Dec 28, 2023

winningsix commented Dec 28, 2023

revans2 commented Dec 28, 2023

NVnavkumar commented Dec 29, 2023 •

edited

Loading

NVnavkumar commented Dec 29, 2023

res-life commented Dec 29, 2023

NVnavkumar commented Dec 29, 2023 •

edited

Loading

revans2 commented Dec 29, 2023

NVnavkumar commented Dec 29, 2023

NVnavkumar commented Dec 29, 2023

NVnavkumar commented Dec 30, 2023

NVnavkumar commented Jan 2, 2024 •

edited

Loading

NVnavkumar commented Jan 3, 2024

NVnavkumar commented Jan 3, 2024

NVnavkumar commented Jan 4, 2024

Enable to_date (via gettimestamp and casting timestamp to date) for non-UTC time zones #10100

Enable to_date (via gettimestamp and casting timestamp to date) for non-UTC time zones #10100

Conversation

NVnavkumar commented Dec 27, 2023 • edited Loading

NVnavkumar commented Dec 27, 2023

NVnavkumar commented Dec 28, 2023

NVnavkumar commented Dec 28, 2023 • edited Loading

NVnavkumar commented Dec 28, 2023

winningsix commented Dec 28, 2023

revans2 commented Dec 28, 2023

NVnavkumar commented Dec 29, 2023 • edited Loading

NVnavkumar commented Dec 29, 2023

res-life commented Dec 29, 2023

NVnavkumar commented Dec 29, 2023 • edited Loading

revans2 commented Dec 29, 2023

NVnavkumar commented Dec 29, 2023

NVnavkumar commented Dec 29, 2023

NVnavkumar commented Dec 30, 2023

NVnavkumar commented Jan 2, 2024 • edited Loading

NVnavkumar commented Jan 3, 2024

NVnavkumar commented Jan 3, 2024

NVnavkumar commented Jan 4, 2024

NVnavkumar commented Dec 27, 2023 •

edited

Loading

NVnavkumar commented Dec 28, 2023 •

edited

Loading

NVnavkumar commented Dec 29, 2023 •

edited

Loading

NVnavkumar commented Dec 29, 2023 •

edited

Loading

NVnavkumar commented Jan 2, 2024 •

edited

Loading