Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21070][PYSPARK] Upgrade cloudpickle #18282

Closed
wants to merge 1 commit into from

Conversation

rgbkrk
Copy link
Contributor

@rgbkrk rgbkrk commented Jun 12, 2017

What changes were proposed in this pull request?

This brings in fixes and upgrades from the cloudpickle module. Notable fixes:

How was this patch tested?

The test suite for cloudpickle tests the internal implementation. There are integration tests here in PySpark that use the CloudPickleSerializer.

This brings in fixes and upgrades from the [cloudpickle](https://github.com/cloudpipe/cloudpickle) module, notably:

* Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80)
* Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90)
* Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88)
* Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85)
* Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72)
* Allow pickling of builtin methods (cloudpipe/cloudpickle#57)
* Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52)
* Support method descriptor (cloudpipe/cloudpickle#46)
* No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32)
@rgbkrk
Copy link
Contributor Author

rgbkrk commented Jun 12, 2017

This fails one test, looking into it now.

@HyukjinKwon
Copy link
Member

@rgbkrk, Do you maybe know if there critical issues that might block the Spark's release, and if it is easy to move to taking an explicit (pinned) dependency on cloudpickle (rather than keeping Spark's local copy of cloudpickle here)?

I remember @holdenk raised a similar issue before.

@holdenk
Copy link
Contributor

holdenk commented Jun 12, 2017

@HyukjinKwon thanks for bringing that up :) We had a PR to consider using py4j as a dependency rather than a copied zip file, @rxin was against the idea. For now I think simply upgrading cloudpickle as is makes sense, and we can revisit the dependency management if the support burden of this approach becomes difficult (or as we start taking on additional Python libraries).

@HyukjinKwon
Copy link
Member

Yea, my only worry is, I think we already have some customized codes in Spark's copy and wonder if it is difficult to resolve the conflicts here (for me, IIRC, it took me a while to check out each conflict before). I would go +1 if it is easy for those changes. I will take a look at my best to help review.

cc @davies too.

@llllllllll
Copy link

I am not sure if this affects pyspark but any function object serialized with 0.2.2 or earlier cannot be deserialized with 0.3.0 or greater. I haven't investigated making this compatible because I am not sure if it is needed but I wanted to make sure you are aware.

@holdenk
Copy link
Contributor

holdenk commented Jun 13, 2017

Thanks for that point @llllllllll , since cloudpickle isn't used for any data which is persisted to be used between different versions of Spark we don't need the format to be backwards compatible.

@holdenk
Copy link
Contributor

holdenk commented Jun 15, 2017

Jenkins, ok to test. Jenkins add to whitelist.

@SparkQA
Copy link

SparkQA commented Jun 15, 2017

Test build #78102 has finished for PR 18282 at commit b84222b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class _empty_cell_value(object):

@rgbkrk
Copy link
Contributor Author

rgbkrk commented Jun 19, 2017

Note: I don't think I'll have time to finish this PR for a few weeks so this is freely available for someone else to take on.

@HyukjinKwon
Copy link
Member

@rgbkrk Shall we close this and reopen when it's ready?

@rgbkrk
Copy link
Contributor Author

rgbkrk commented Jun 26, 2017

Works for me, sorry I don't have cycles at the moment.

@rgbkrk rgbkrk closed this Jun 26, 2017
ghost pushed a commit to dbtsai/spark that referenced this pull request Aug 22, 2017
## What changes were proposed in this pull request?

Based on apache#18282 by rgbkrk this PR attempts to update to the current released cloudpickle and minimize the difference between Spark cloudpickle and "stock" cloud pickle with the goal of eventually using the stock cloud pickle.

Some notable changes:
* Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80)
* Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90)
* Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88)
* Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85)
* Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72)
* Allow pickling of builtin methods (cloudpipe/cloudpickle#57)
* Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52)
* Support method descriptor (cloudpipe/cloudpickle#46)
* No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32)
* ** Remove non-standard __transient__check (cloudpipe/cloudpickle#110)** -- while we don't use this internally, and have no tests or documentation for its use, downstream code may use __transient__, although it has never been part of the API, if we merge this we should include a note about this in the release notes.
* Support for pickling loggers (yay!) (cloudpipe/cloudpickle#96)
* BUG: Fix crash when pickling dynamic class cycles. (cloudpipe/cloudpickle#102)

## How was this patch tested?

Existing PySpark unit tests + the unit tests from the cloudpickle project on their own.

Author: Holden Karau <holden@us.ibm.com>
Author: Kyle Kelley <rgbkrk@gmail.com>

Closes apache#18734 from holdenk/holden-rgbkrk-cloudpickle-upgrades.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants