Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parquet read tests that fail on Databricks with date/time input [databricks] #9639

Closed
wants to merge 4 commits into from

Conversation

ttnghia
Copy link
Collaborator

@ttnghia ttnghia commented Nov 6, 2023

Before #9617, reading parquet files fallback to CPU on LEGACY rebase mode only if the input contains date/time columns nested inside other columns. After PR 9617, reading parquet files now always fallback to CPU on LEGACY rebase mode if there is any date/time input at any nested level.

Since Databricks sets the rebase mode to LEGACY by default, some tests for parquet read now fail. This PR adds the explicit read config to fix them.

Closes #9636.

Signed-off-by: Nghia Truong <nghiat@nvidia.com>
@ttnghia ttnghia added the test Only impacts tests label Nov 6, 2023
@ttnghia ttnghia self-assigned this Nov 6, 2023
@ttnghia
Copy link
Collaborator Author

ttnghia commented Nov 6, 2023

build

@ttnghia
Copy link
Collaborator Author

ttnghia commented Nov 6, 2023

build

@jlowe
Copy link
Member

jlowe commented Nov 6, 2023

After PR 9617, reading parquet files now always fallback to CPU on LEGACY rebase mode if there is any date/time input at any nested level.

Seems like #9617 is broken then and the proper fix is to revert it? This seems like a regression in functionality vs. what we supported before. Tests for date/timestamps at the top level of the schema were passing on Databricks before.

@ttnghia
Copy link
Collaborator Author

ttnghia commented Nov 6, 2023

I think #9617 is correct (i.e., before it, there was a bug). Because if the read config is set to LEGACY, we must fallback to the CPU (after #9617), instead of just running on the GPU then throwing exception if the input needs rebase (before #9617).

@jlowe
Copy link
Member

jlowe commented Nov 6, 2023

before it, there was a bug

It was not a bug but an intentional, calculated risk. In most cases, the query runs as expected, GPU accelerated with no crashes. Yes, sometimes it can crash, but the alternative in #9617 is that now all GPU Parquet reads on Databricks involving dates or timestamps are disabled by default. That will be very impactful for many users, and thus seems like #9617 should not have been committed.

@ttnghia
Copy link
Collaborator Author

ttnghia commented Nov 6, 2023

Okay then I can revert the GPU tag logic in it.

Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] All parquet integration tests failed "Part of the plan is not columnar class" in databricks runtimes
2 participants