Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support barrier mode for mapInPandas/mapInArrow #10364

Merged
merged 1 commit into from
Feb 2, 2024

Conversation

wbo4958
Copy link
Collaborator

@wbo4958 wbo4958 commented Feb 2, 2024

To fix #10344

Spark 3.5 has introduced a new feature supporting barrier mode for mapInPandas/mapInArrow, more detail can be found at https://issues.apache.org/jira/browse/SPARK-42896. However, spark-rapids missed this feature which resulted in unexpected behavior. For example

spark.range(10).mapInPandas(lambda x: x, "id long", True)

The same tasks of the above code will run on barrier mode on the CPU, while on non-barrier mode on the GPU with spark-rapids.

Signed-off-by: Bobby Wang <wbo4958@gmail.com>
@wbo4958
Copy link
Collaborator Author

wbo4958 commented Feb 2, 2024

build

@@ -425,3 +426,41 @@ def test_func(spark):
lambda data: [pd.DataFrame([len(list(data))])], schema="ret:integer")

assert_gpu_and_cpu_are_equal_collect(test_func, conf=arrow_udf_conf)


@pytest.mark.skipif(is_before_spark_350(),
Copy link
Collaborator

@firestarman firestarman Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Better to ignore order for result comparison.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to do that, since there is only 1 partition, the result will not be mess up.

@wbo4958 wbo4958 merged commit cca5955 into NVIDIA:branch-24.04 Feb 2, 2024
40 of 41 checks passed
@wbo4958 wbo4958 deleted the barrier branch February 2, 2024 06:15
jlowe added a commit to jlowe/spark-rapids that referenced this pull request Feb 2, 2024
This reverts commit cca5955.

Signed-off-by: Jason Lowe <jlowe@nvidia.com>
jlowe added a commit that referenced this pull request Feb 2, 2024
…0369)

This reverts commit cca5955.

Signed-off-by: Jason Lowe <jlowe@nvidia.com>
@sameerz sameerz added the task Work required that improves the product but is not user facing label Feb 4, 2024
wbo4958 added a commit to wbo4958/spark-rapids that referenced this pull request Feb 5, 2024
Signed-off-by: Bobby Wang <wbo4958@gmail.com>
jlowe pushed a commit that referenced this pull request Feb 6, 2024
* Support barrier mode for mapInPandas/mapInArrow (#10364)

Signed-off-by: Bobby Wang <wbo4958@gmail.com>

* support databricks

* license

---------

Signed-off-by: Bobby Wang <wbo4958@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants