Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44548][PYTHON] Add support for pandas-on-Spark DataFrame assertDataFrameEqual #42158

Closed
wants to merge 12 commits into from

Conversation

asl3
Copy link
Contributor

@asl3 asl3 commented Jul 25, 2023

What changes were proposed in this pull request?

This PR adds support for pandas-on-Spark DataFrame for the testing util, assertDataFrameEqual

Why are the changes needed?

The change allows users to call the same PySpark API, assertDataFrameEqual, for both Spark and pandas-on-Spark DataFrames. It also exposes a new user-facing API, assertPandasOnSparkEqual.

Does this PR introduce any user-facing change?

Yes, the PR affects the user-facing util assertDataFrameEqual and exposes a new user-facing API, assertPandasOnSparkEqual.

How was this patch tested?

Added tests to python/pyspark/sql/tests/test_utils.py and python/pyspark/sql/tests/connect/test_utils.py and existing pandas util tests.

@github-actions github-actions bot added the BUILD label Jul 26, 2023
@HyukjinKwon HyukjinKwon changed the title [SPARK-44548] Add support for pandas DataFrame assertDataFrameEqual [SPARK-44548][PYTHON] Add support for pandas DataFrame assertDataFrameEqual Jul 26, 2023
@asl3 asl3 changed the title [SPARK-44548][PYTHON] Add support for pandas DataFrame assertDataFrameEqual [SPARK-44548][PYTHON] Add support for pandas-on-Spark DataFrame assertDataFrameEqual Jul 27, 2023
@HyukjinKwon
Copy link
Member

@itholic I think you should review this.

python/pyspark/errors/error_classes.py Show resolved Hide resolved
python/pyspark/pandas/tests/test_utils.py Show resolved Hide resolved
python/pyspark/testing/pandasutils.py Outdated Show resolved Hide resolved
python/pyspark/testing/utils.py Show resolved Hide resolved
@asl3 asl3 requested a review from itholic July 27, 2023 18:21
Copy link
Contributor

@itholic itholic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall structure looks fine to me, but left some comments for some error class/error message refactoring.

Let's do not forget to create related tickets and resolve these with follow-ups.

python/pyspark/errors/error_classes.py Outdated Show resolved Hide resolved
python/pyspark/testing/pandasutils.py Outdated Show resolved Hide resolved
python/pyspark/testing/pandasutils.py Outdated Show resolved Hide resolved
python/pyspark/testing/pandasutils.py Outdated Show resolved Hide resolved
python/pyspark/testing/pandasutils.py Outdated Show resolved Hide resolved
python/pyspark/testing/pandasutils.py Outdated Show resolved Hide resolved
python/pyspark/testing/pandasutils.py Outdated Show resolved Hide resolved
python/pyspark/testing/utils.py Show resolved Hide resolved
@itholic
Copy link
Contributor

itholic commented Jul 28, 2023

Looks pretty good. cc @HyukjinKwon for confirming as CI is passed

@HyukjinKwon
Copy link
Member

Merged to master and branch-3.5.

HyukjinKwon pushed a commit that referenced this pull request Jul 28, 2023
…tDataFrameEqual

### What changes were proposed in this pull request?
This PR adds support for pandas-on-Spark DataFrame for the testing util, `assertDataFrameEqual`

### Why are the changes needed?
The change allows users to call the same PySpark API for both Spark and pandas DataFrames.

### Does this PR introduce _any_ user-facing change?
Yes, the PR affects the user-facing util `assertDataFrameEqual`

### How was this patch tested?
Added tests to `python/pyspark/sql/tests/test_utils.py` and `python/pyspark/sql/tests/connect/test_utils.py` and existing pandas util tests.

Closes #42158 from asl3/pandas-or-pyspark-df.

Authored-by: Amanda Liu <amanda.liu@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 7c1ad5b)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
…tDataFrameEqual

### What changes were proposed in this pull request?
This PR adds support for pandas-on-Spark DataFrame for the testing util, `assertDataFrameEqual`

### Why are the changes needed?
The change allows users to call the same PySpark API for both Spark and pandas DataFrames.

### Does this PR introduce _any_ user-facing change?
Yes, the PR affects the user-facing util `assertDataFrameEqual`

### How was this patch tested?
Added tests to `python/pyspark/sql/tests/test_utils.py` and `python/pyspark/sql/tests/connect/test_utils.py` and existing pandas util tests.

Closes apache#42158 from asl3/pandas-or-pyspark-df.

Authored-by: Amanda Liu <amanda.liu@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants