Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding HDFS support for data generation #188
Adding HDFS support for data generation #188
Changes from 9 commits
92410f5
1d81676
7c6bfd4
0940abf
9646c81
2d95d61
d7e9df4
19e8c01
26f5037
bb70cc5
461227a
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that beyond subpar user-perceived delays with shelling out to launch heavy JVMs, we hit limitations in the past where hadoop CLI is not available. If we document that this script can be launched via spark-submit than we can use PY4J NVIDIA/spark-rapids#10599
On the other hand why do we need to wrap a Java program in Python CLI to begin with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The java program is a mapper triggered only when generating data for hdfs. In case of local data generation, the python wrapper does not trigger a mapreduce job.
For the missing hadoop cli, there is a primary check in the python program triggering install hadoop cli message just for verbosity.
Here the hadoop job is just creating a limiting set of directories ( 8 in total - 1 per nds-h table) and moving nds-h generated data to the required folders.
Currently this is not being triggered via spark-submit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more like a tech debt. When initializing this project, the order was to use Python and void the one that DB is using. I argued that time that we can also use Scala but failed.
More details I can recall, to avoid the way that we chain call "python-hdfs", the best option was to leverage the
argument to pipe text output to stdout then pipe it directly into Spark Dataframe. (This is also what DB does) In this way we don't need any hadoop job to help generate distributed dataset.
Unluckily, latest TPC-DS v.3.20 disabled this argument, and the order was to use latest version and try our best
not
to modify it.Thus it becomes what it is now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per allen's comment, I can pick this up as a separate issue later to figure out if there is any alternate solution to avoid chaining hadoop commands from python.