Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding HDFS support for data generation #188

Merged
merged 11 commits into from
Jul 3, 2024

Conversation

bilalbari
Copy link
Collaborator

@bilalbari bilalbari commented Jun 10, 2024

This PR contains the following changes -

  1. Adding DbGen class for running data generation as part of mapper
  2. Updating build files for the same
  3. Changes to README
  4. Changes to existing python files for supporting HDFS data generation

Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
nds-h/tpch-gen/pom.xml Outdated Show resolved Hide resolved
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
@@ -115,16 +116,96 @@ def generate_data_local(args, range_start, range_end, tool_path):
# show summary
subprocess.run(['du', '-h', '-d1', data_dir])

def clean_temp_data(temp_data_path):
cmd = ['hadoop', 'fs', '-rm', '-r', '-skipTrash', temp_data_path]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that beyond subpar user-perceived delays with shelling out to launch heavy JVMs, we hit limitations in the past where hadoop CLI is not available. If we document that this script can be launched via spark-submit than we can use PY4J NVIDIA/spark-rapids#10599

On the other hand why do we need to wrap a Java program in Python CLI to begin with?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The java program is a mapper triggered only when generating data for hdfs. In case of local data generation, the python wrapper does not trigger a mapreduce job.
For the missing hadoop cli, there is a primary check in the python program triggering install hadoop cli message just for verbosity.
Here the hadoop job is just creating a limiting set of directories ( 8 in total - 1 per nds-h table) and moving nds-h generated data to the required folders.
Currently this is not being triggered via spark-submit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more like a tech debt. When initializing this project, the order was to use Python and void the one that DB is using. I argued that time that we can also use Scala but failed.

More details I can recall, to avoid the way that we chain call "python-hdfs", the best option was to leverage the

_FILTER =  [Y|N]         -- output data to stdout

argument to pipe text output to stdout then pipe it directly into Spark Dataframe. (This is also what DB does) In this way we don't need any hadoop job to help generate distributed dataset.

Unluckily, latest TPC-DS v.3.20 disabled this argument, and the order was to use latest version and try our best not to modify it.

Thus it becomes what it is now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per allen's comment, I can pick this up as a separate issue later to figure out if there is any alternate solution to avoid chaining hadoop commands from python.

Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
mattahrens
mattahrens previously approved these changes Jun 24, 2024
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>
@bilalbari bilalbari merged commit c41b702 into NVIDIA:dev Jul 3, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants