Fix parquet read tests that fail on Databricks with date/time input [databricks] #9639

ttnghia · 2023-11-06T05:20:06Z

Before #9617, reading parquet files fallback to CPU on LEGACY rebase mode only if the input contains date/time columns nested inside other columns. After PR 9617, reading parquet files now always fallback to CPU on LEGACY rebase mode if there is any date/time input at any nested level.

Since Databricks sets the rebase mode to LEGACY by default, some tests for parquet read now fail. This PR adds the explicit read config to fix them.

Closes #9636.

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

ttnghia · 2023-11-06T05:20:21Z

build

ttnghia · 2023-11-06T07:06:48Z

build

jlowe · 2023-11-06T14:53:52Z

After PR 9617, reading parquet files now always fallback to CPU on LEGACY rebase mode if there is any date/time input at any nested level.

Seems like #9617 is broken then and the proper fix is to revert it? This seems like a regression in functionality vs. what we supported before. Tests for date/timestamps at the top level of the schema were passing on Databricks before.

ttnghia · 2023-11-06T14:59:53Z

I think #9617 is correct (i.e., before it, there was a bug). Because if the read config is set to LEGACY, we must fallback to the CPU (after #9617), instead of just running on the GPU then throwing exception if the input needs rebase (before #9617).

jlowe · 2023-11-06T15:35:08Z

before it, there was a bug

It was not a bug but an intentional, calculated risk. In most cases, the query runs as expected, GPU accelerated with no crashes. Yes, sometimes it can crash, but the alternative in #9617 is that now all GPU Parquet reads on Databricks involving dates or timestamps are disabled by default. That will be very impactful for many users, and thus seems like #9617 should not have been committed.

ttnghia · 2023-11-06T15:38:10Z

Okay then I can revert the GPU tag logic in it.

This reverts commit d48f173.

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Add parquet read config

d48f173

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

ttnghia added the test Only impacts tests label Nov 6, 2023

ttnghia self-assigned this Nov 6, 2023

ttnghia mentioned this pull request Nov 6, 2023

Refactor Parquet readers [databricks] #9631

Merged

ttnghia added 3 commits November 6, 2023 08:19

Revert "Add parquet read config"

673caba

This reverts commit d48f173.

Allow to run on GPU in LEGACY rebase mode

38ca670

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Remove unused var

e9c204e

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

ttnghia closed this Nov 6, 2023

ttnghia mentioned this pull request Nov 6, 2023

Revert "Support rebase checking for nested dates and timestamps (#9617)" [databricks] #9641

Merged

ttnghia deleted the fix_parquet_read_db branch November 9, 2023 21:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix parquet read tests that fail on Databricks with date/time input [databricks] #9639

Fix parquet read tests that fail on Databricks with date/time input [databricks] #9639

ttnghia commented Nov 6, 2023 •

edited

Loading

ttnghia commented Nov 6, 2023

ttnghia commented Nov 6, 2023

jlowe commented Nov 6, 2023

ttnghia commented Nov 6, 2023 •

edited

Loading

jlowe commented Nov 6, 2023

ttnghia commented Nov 6, 2023

Fix parquet read tests that fail on Databricks with date/time input [databricks] #9639

Fix parquet read tests that fail on Databricks with date/time input [databricks] #9639

Conversation

ttnghia commented Nov 6, 2023 • edited Loading

ttnghia commented Nov 6, 2023

ttnghia commented Nov 6, 2023

jlowe commented Nov 6, 2023

ttnghia commented Nov 6, 2023 • edited Loading

jlowe commented Nov 6, 2023

ttnghia commented Nov 6, 2023

ttnghia commented Nov 6, 2023 •

edited

Loading

ttnghia commented Nov 6, 2023 •

edited

Loading