Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call globStatus directly via PY4J in hdfs_glob to avoid calling hadoop command #10599

Merged
merged 5 commits into from
Mar 19, 2024

Conversation

yinqingh
Copy link
Collaborator

To fix #10586

In rapids_it-EGX-Yarn job, the integration tests run in container which doesn't have hadoop cli installed. To fix this, I update hdfs_glob function to call globStatus directly via PY4J to avoid calling hadoop command

Signed-off-by: Yinqing Hao <haoyinqing@gmail.com>
@NvTimLiu
Copy link
Collaborator

This is also a follow up of 9dc04aa#r1477618915 @gerashegalov , thanks!

@yinqingh
Copy link
Collaborator Author

build

# status.getPath().toString() return string like "hdfs://hostname:8020/src/test/resources/parquet-testing/data/single_nan.parquet"
# but pathlib.Path will remove the first "/" and convert it to "hdfs:/hostname:8020/src/test/resources/parquet-testing/data/single_nan.parquet" and then this path becomes illegal.
# so we need to process the path like this.
p = f'hdfs:{status.getPath().toUri().getPath()}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can get the scheme "hdfs" from the URI instead of hardcoding it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the hardcoding for scheme "hdfs". Thanks!

Signed-off-by: Yinqing Hao <haoyinqing@gmail.com>
Signed-off-by: Yinqing Hao <haoyinqing@gmail.com>
Signed-off-by: Yinqing Hao <haoyinqing@gmail.com>
@yinqingh
Copy link
Collaborator Author

The github action build failures will be fixed in #10608

Signed-off-by: Yinqing Hao <haoyinqing@gmail.com>
@yinqingh
Copy link
Collaborator Author

build

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

sc = get_spark_i_know_what_i_am_doing().sparkContext
config = sc._jsc.hadoopConfiguration()
fs_path = sc._jvm.org.apache.hadoop.fs.Path(full_pattern)
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(fs_path.toUri(), config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. no need to fix. a handy idiom especially if you need to access the fs object only once with the added benefit avoiding accessing sc._jvm

Suggested change
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(fs_path.toUri(), config)
fs = fs_path.getFileSystem(config)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is helpful. Thanks!

:return: generator of matched files
"""
path_str = path.as_posix()
if not path_str.startswith('hdfs:'):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not for this PR, but this should be agnostic of a particular non-local FileSystem path.

@yinqingh yinqingh merged commit 3ed26c2 into NVIDIA:branch-24.04 Mar 19, 2024
43 checks passed
@yinqingh yinqingh deleted the fix-hdfs-glob branch March 19, 2024 01:52
@sameerz sameerz added the test Only impacts tests label Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] YARN EGX IT build failing parquet_testing_test can't find file
5 participants