-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Call globStatus directly via PY4J in hdfs_glob to avoid calling hadoop command #10599
Conversation
Signed-off-by: Yinqing Hao <haoyinqing@gmail.com>
This is also a follow up of 9dc04aa#r1477618915 @gerashegalov , thanks! |
build |
# status.getPath().toString() return string like "hdfs://hostname:8020/src/test/resources/parquet-testing/data/single_nan.parquet" | ||
# but pathlib.Path will remove the first "/" and convert it to "hdfs:/hostname:8020/src/test/resources/parquet-testing/data/single_nan.parquet" and then this path becomes illegal. | ||
# so we need to process the path like this. | ||
p = f'hdfs:{status.getPath().toUri().getPath()}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can get the scheme "hdfs" from the URI instead of hardcoding it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the hardcoding for scheme "hdfs". Thanks!
Signed-off-by: Yinqing Hao <haoyinqing@gmail.com>
Signed-off-by: Yinqing Hao <haoyinqing@gmail.com>
Signed-off-by: Yinqing Hao <haoyinqing@gmail.com>
The github action build failures will be fixed in #10608 |
Signed-off-by: Yinqing Hao <haoyinqing@gmail.com>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
sc = get_spark_i_know_what_i_am_doing().sparkContext | ||
config = sc._jsc.hadoopConfiguration() | ||
fs_path = sc._jvm.org.apache.hadoop.fs.Path(full_pattern) | ||
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(fs_path.toUri(), config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. no need to fix. a handy idiom especially if you need to access the fs object only once with the added benefit avoiding accessing sc._jvm
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(fs_path.toUri(), config) | |
fs = fs_path.getFileSystem(config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is helpful. Thanks!
:return: generator of matched files | ||
""" | ||
path_str = path.as_posix() | ||
if not path_str.startswith('hdfs:'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not for this PR, but this should be agnostic of a particular non-local FileSystem path.
To fix #10586
In rapids_it-EGX-Yarn job, the integration tests run in container which doesn't have hadoop cli installed. To fix this, I update
hdfs_glob
function to call globStatus directly via PY4J to avoid callinghadoop
command