Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable to_date (via gettimestamp and casting timestamp to date) for non-UTC time zones #10100

Merged
merged 15 commits into from
Jan 4, 2024

Conversation

NVnavkumar
Copy link
Collaborator

@NVnavkumar NVnavkumar commented Dec 27, 2023

Fixes #9927.

This enables to_date for non-UTC time zones. to_date(str, fmt) is actually an alias for cast(gettimestamp(str, fmt) as date). So enable casting timestamp to date, and enable non-UTC time zones for gettimestamp (which basically the parent of the same algorithm used in unix_timestamp).

…imestamp)

Signed-off-by: Navin Kumar <navink@nvidia.com>
Signed-off-by: Navin Kumar <navink@nvidia.com>
@NVnavkumar
Copy link
Collaborator Author

build

Signed-off-by: Navin Kumar <navink@nvidia.com>
@NVnavkumar
Copy link
Collaborator Author

build

@NVnavkumar
Copy link
Collaborator Author

NVnavkumar commented Dec 28, 2023

Some performance testing results:

    NUM ROWS     GPU (ms)     CPU (ms)              SPEEDUP
        1000          177          250   1.4124293785310735
        2500           38           37   0.9736842105263158
        5000           40           39                0.975
       10000           44           37   0.8409090909090909
       20000           45           27                  0.6
       50000           53           71   1.3396226415094339
      100000           85          168   1.9764705882352942
      250000          134          267    1.992537313432836
      500000          176          463   2.6306818181818183
     1000000          293          955     3.25938566552901

In this case, I set the session timezone to Iran and generated a parquet file with a single column a of a randomly generated date strings in the yyyy-MM-dd format in the valid ANSI range (0001-01-01 to 9999-12-31). I then read that parquet file and ran to_date(a, 'yyyy-MM-dd') on that column with a CPU and GPU run. I used sparkmeasure to measure the elapsed time of the queries. Before running the test I also ran a query using from_utc_timestamp to load the timezone database since it's still lazy loaded at this point, and that would skew the performance data here.

@NVnavkumar NVnavkumar marked this pull request as ready for review December 28, 2023 06:45
@NVnavkumar
Copy link
Collaborator Author

build

@winningsix
Copy link
Collaborator

Some performance testing results:

    NUM ROWS     GPU (ms)     CPU (ms)              SPEEDUP
        1000          177          250   1.4124293785310735
        2500           38           37   0.9736842105263158
        5000           40           39                0.975
       10000           44           37   0.8409090909090909
       20000           45           27                  0.6
       50000           53           71   1.3396226415094339
      100000           85          168   1.9764705882352942
      250000          134          267    1.992537313432836
      500000          176          463   2.6306818181818183
     1000000          293          955     3.25938566552901

In this case, I set the session timezone to Iran and generated a parquet file with a single column a of a randomly generated date strings in the yyyy-MM-dd format in the valid ANSI range (0001-01-01 to 9999-12-31). I then read that parquet file and ran to_date(a, 'yyyy-MM-dd') on that column with a CPU and GPU run. I used sparkmeasure to measure the elapsed time of the queries. Before running the test I also ran a query using from_utc_timestamp to load the timezone database since it's still lazy loaded at this point, and that would skew the performance data here.

Do we have some ideas why row num = 20000, we had some lower perf acceleration? Will this perf result vary across different measurements?

@revans2
Copy link
Collaborator

revans2 commented Dec 28, 2023

Do we have some ideas why row num = 20000, we had some lower perf acceleration? Will this perf result vary across different measurements?

That is way too little data to be a good benchmark. 20,000 rows is < 300k of data. I don't know how many row groups there are, but the overhead of launching kernels is likely most of the time being spent here. I would much rather see a test like.

spark.range(0, 400000000L, 1, 12).selectExpr("date_from_unix_date(id % 700000) as d").selectExpr("d", "CAST(d as STRING) as ds").write.mode("overwrite").parquet("./target/TMP")

then we can try and isolate just the to_date as much as possible with

spark.time(spark.read.parquet("./target/TMP").selectExpr("to_date(ds, 'yyyy-MM-dd') as date").selectExpr("MAX(date)", "MIN(date)").show())

and

spark.time(spark.read.parquet("./target/TMP").selectExpr("d as date").selectExpr("MAX(date)", "MIN(date)").show())

For UTC on the GPU I get
to_date => 1459, 1478, 1393, 1420, 1393 - median 1420
just min/max => 811, 831, 783, 753, 764 - median 753 or about 667 ms for just the to_date which is not too far off from the approximate 1s optime I see on the GPU for that ProjectExec.

For the CPU I get
to_date => 30273, 29722, 29750, 29645, 29612 - median 29722
just min/max => 673, 570, 536, 529, 541 - median 536 or about 29700 ms for just the to_date.

Or we could just compare UTC runs agains Iran runs for GPU and separately for CPU to give us an idea of the overhead that Iran adds to the operation.

@NVnavkumar
Copy link
Collaborator Author

NVnavkumar commented Dec 29, 2023

Using the approach suggested by @revans2, I generated the parquet file using

spark.range(0, num_rows, 1, 12).selectExpr("date_from_unix_date(id % 700000) as d").selectExpr("d", "CAST(d as STRING) as ds").write.mode("overwrite").parquet(tmpfile)

and then measured the difference between the total time of min(date), max(date) and the total_time to_date and min/max.

Resuilts:

    NUM ROWS     GPU (ms)     CPU (ms)              SPEEDUP
   100000000          143        14866   103.95804195804196
   200000000          448        28896                 64.5
   400000000          883        58499    66.25028312570781
   800000000         1483       115150    77.64666217127444

I verified the last result with the Spark UI. Looks like 1483 ms is consistent with the 1.4 s shown in the GpuProject operator here:

Screenshot 2023-12-28 at 6 03 31 PM

@NVnavkumar
Copy link
Collaborator Author

Moving this to draft, looks like there is an issue in casting timestamp to date that is causing off by 1 issues when making it timezone aware.

@NVnavkumar NVnavkumar marked this pull request as draft December 29, 2023 02:22
Signed-off-by: Navin Kumar <navink@nvidia.com>
@res-life
Copy link
Collaborator

@NVnavkumar NVnavkumar marked this pull request as ready for review December 29, 2023 07:17
…to date

Signed-off-by: Navin Kumar <navink@nvidia.com>
Signed-off-by: Navin Kumar <navink@nvidia.com>
Signed-off-by: Navin Kumar <navink@nvidia.com>
@NVnavkumar
Copy link
Collaborator Author

NVnavkumar commented Dec 29, 2023

Updated results after casting timestamp to date fix for non-UTC timezones. The timezone tested was Iran:

    NUM ROWS     GPU (ms)     CPU (ms)              SPEEDUP
   100000000          195        14563    74.68205128205128
   200000000          456        27386    60.05701754385965
   400000000          847        52842    62.38724911452184
   800000000         1581       107031    67.69829222011386

Verified the last item (800,000,000 rows) in Spark UI:

Screenshot 2023-12-29 at 12 11 41 AM

@revans2
Copy link
Collaborator

revans2 commented Dec 29, 2023

build

revans2
revans2 previously approved these changes Dec 29, 2023
@NVnavkumar
Copy link
Collaborator Author

build

@sameerz sameerz added the feature request New feature or request label Dec 29, 2023
@NVnavkumar
Copy link
Collaborator Author

premerge failure due to some issue with Gpu Time Zone database and unit tests, filed #10129 to track.

@NVnavkumar
Copy link
Collaborator Author

This is blocked until NVIDIA/spark-rapids-jni#1670 is merged

@NVnavkumar
Copy link
Collaborator Author

NVnavkumar commented Jan 2, 2024

Also fixes #10006

scala> val df = Seq("2023-12-31 23:59:59").toDF("ts")
scala> spark.conf.set("spark.sql.session.timeZone", "Asia/Shanghai")
scala> df.selectExpr("to_timestamp(ts)").show()
24/01/02 21:30:26 WARN GpuOverrides:
! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
  @Expression <AttributeReference> toprettystring(to_timestamp(ts))#9 could run on GPU

+-------------------+
|   to_timestamp(ts)|
+-------------------+
|2023-12-31 23:59:59|
+-------------------+
scala> val df2 = df.selectExpr("to_timestamp(ts)")
scala> spark.conf.set("spark.sql.session.timeZone", "UTC")

scala> df2.show()
24/01/02 21:31:15 WARN GpuOverrides:
! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
  @Expression <AttributeReference> toprettystring(to_timestamp(ts))#23 could run on GPU

+-------------------+
|   to_timestamp(ts)|
+-------------------+
|2023-12-31 15:59:59|
+-------------------+

@NVnavkumar NVnavkumar linked an issue Jan 2, 2024 that may be closed by this pull request
…nsition times

Signed-off-by: Navin Kumar <navink@nvidia.com>
@NVnavkumar
Copy link
Collaborator Author

build

revans2
revans2 previously approved these changes Jan 3, 2024
Signed-off-by: Navin Kumar <navink@nvidia.com>
@NVnavkumar
Copy link
Collaborator Author

build

Signed-off-by: Navin Kumar <navink@nvidia.com>
@NVnavkumar
Copy link
Collaborator Author

build

@revans2 revans2 merged commit 240d661 into NVIDIA:branch-24.02 Jan 4, 2024
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
5 participants