Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ORC reading over different schemas #5676

Merged
merged 2 commits into from
May 31, 2022

Conversation

sperlingxx
Copy link
Collaborator

@sperlingxx sperlingxx commented May 27, 2022

Fixes #5562

This PR is to fix the bug which fails reading ORC data from multiple file schemas. The root cause of the bug is multi-thread modification on a thread unsafe variable (broadcasted Hadoop Conf).

Signed-off-by: sperlingxx lovedreamf@gmail.com

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@sperlingxx sperlingxx requested review from wbo4958 and jlowe May 27, 2022 09:52
@sperlingxx
Copy link
Collaborator Author

build

@jlowe jlowe added this to the May 23 - Jun 3 milestone May 27, 2022
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it interesting that this PR claims it fixes #5562 but does not use the same repro test from that issue as part of its test case. The test in #5562 seems subtly different than the one used here, as the order of columns in the dataframe is intentionally different, whereas here the test preserves the same order and the extra one is on the end. Have we verified the same test from #5562 now passes with this change?

@sameerz sameerz added the bug Something isn't working label May 27, 2022
@sperlingxx
Copy link
Collaborator Author

I find it interesting that this PR claims it fixes #5562 but does not use the same repro test from that issue as part of its test case. The test in #5562 seems subtly different than the one used here, as the order of columns in the dataframe is intentionally different, whereas here the test preserves the same order and the extra one is on the end. Have we verified the same test from #5562 now passes with this change?

Yes, the case in #5562 can be addressed by this PR. I decided to use current case rather than the original case, because it looks stronger and more representative. In terms of the column order, I shuffled the order of columns to include the factor into the case.

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@sperlingxx
Copy link
Collaborator Author

build

@tgravescs tgravescs merged commit e31b087 into NVIDIA:branch-22.06 May 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] read ORC file with various file schemas
4 participants