Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove non-standard __transient__ support #110

Merged
merged 1 commit into from
Jul 25, 2017
Merged

Conversation

Peque
Copy link
Member

@Peque Peque commented Jul 24, 2017

The __transient__ dunder attribute is not the standard way to prevent attributes from being pickled. Instead, the standard approach is to use the __getstate__ and __setstate__ magic methods.

Considering that:

  • This is an old fix that was implemented maybe for unsupported Python versions (i.e.: Python 2.6).
  • Nobody knows what it is doing there exactly or why it was added.
  • Having this special non-standard case implemented makes the code more complex and may result in unexpected behavior (see The __transient__ attribute. What is it? Could it be thrown away? #108).
  • This code was not even covered by tests, so removing it should increase code coverage and make the module more robust.

I propose to remove any support for the non-standard approach.

As mentioned in #108, some other projects might be using this attribute. But when looking at those projects:

  • Most simply have copies of the cloudpickle.py file (hence the match when searching for __transient__).
  • Others simply seem to be using __transient__ but without depending on cloudpickle as an external module dependency.
  • Most seem to be not very relevant (i.e.: fewer than 2 stars).

I think even though this change may break other's code it is an unlikely scenario. Anyway, if that was the case, I think they should be fixing their code rather than making cloudpickle carry ugly fixes. Also, they can always choose to use an older cloudpickle version from PyPi.

Fixes #108.

@codecov-io
Copy link

codecov-io commented Jul 24, 2017

Codecov Report

Merging #110 into master will increase coverage by 1.77%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #110      +/-   ##
==========================================
+ Coverage   81.34%   83.12%   +1.77%     
==========================================
  Files           2        2              
  Lines         563      551      -12     
  Branches      113      107       -6     
==========================================
  Hits          458      458              
+ Misses         75       65      -10     
+ Partials       30       28       -2
Impacted Files Coverage Δ
cloudpickle/cloudpickle.py 83.02% <ø> (+1.77%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d986894...bb3708c. Read the comment docs.

@rgbkrk
Copy link
Member

rgbkrk commented Jul 24, 2017

I think this is good reasoning and we'll want to make a major version release for this change if others think this is a good path to go.

/cc @ogrisel and @mrocklin

@irmen
Copy link

irmen commented Jul 24, 2017

awesome analysis btw @Peque

@mrocklin
Copy link
Contributor

I have no objection to removing support for this non-standard protocol

@holdenk
Copy link

holdenk commented Jul 24, 2017

Anyone mind if I give this a quick spin in Spark to see if we run into any issues first? We don't directly use the library but I'm eternally hopefully I'll find a way to migrate us to using the common cloud pickle.

@rgbkrk
Copy link
Member

rgbkrk commented Jul 24, 2017

Anyone mind if I give this a quick spin in Spark to see if we run into any issues first?

Happy to wait to see.

@irmen
Copy link

irmen commented Jul 24, 2017

@holdenk unrelated to this particular pull request but I'm curious: do you know if this is in any way related to the use of regular Python pickle + my pyrolite library (that provides a java implementation of the pickle protocol), in Spark?

@holdenk
Copy link

holdenk commented Jul 24, 2017

@irmen I'm not sure I'm resolving the "this" correctly in do you know if this is in any way related to the use of regular Python pickle correctly - what are you referencing?

@irmen
Copy link

irmen commented Jul 24, 2017

@holdenk not really sure myself tbh... I'm not familiar with [py]Spark myself. I guess I'm looking for the reason cloudpickle is used in Spark when pickle (+pyrolite for java connectivity) is used as well. Maybe I'm confusing pyspark and spark itself

@holdenk
Copy link

holdenk commented Jul 24, 2017

So from memory the CloudPickle serializer is used for "commands" or functions (as well as things included in their scope like broadcast variables). I think its just for better handling of serialized functions+dependencies whereas for the contents of data we need to be able to parse the keys (which are encoded as long values) in Java - we don't have to understand any part of the function in Java.

Of course this is based off of fuzzy memories of long ago discussions, so don't hold me to be 100% accurate and before you make any decissions on this reach out and we can sort out the real reasons (or a close enough approximation for whatever you need).

@rgbkrk
Copy link
Member

rgbkrk commented Jul 24, 2017

I got a little curious if it would be helpful if PySpark and Spark (in Scala) used pickles for the interchange format using pyrolite on the JVM side.

@holdenk
Copy link

holdenk commented Jul 24, 2017

So at the end of the day I think we're going to start seeing different formats being used for interchange in Spark (like Arrow) - so I think the best path forward is seeing what happens there. I'm going to also perhaps take this opportunity to resurrects @rgbkrk's work on getting Spark using a current version of cloudpickle since I need to do some of that to test this PR w/Spark anyways.

@irmen
Copy link

irmen commented Jul 25, 2017

@rgbkrk as far as I know, Spark already uses pickle + pyrolite.

@rgbkrk
Copy link
Member

rgbkrk commented Jul 25, 2017

Oh silly me.

@holdenk
Copy link

holdenk commented Jul 25, 2017

So I'm seeing some failures with this in Spark, but my build was in a bit of a bad state so I'm going to kick off a fully clean build and see where this takes us. Sorry for my slowness at looking at this on the Spark side I've been running around a lot the past few days.

@irmen
Copy link

irmen commented Jul 25, 2017

no worries

@holdenk
Copy link

holdenk commented Jul 25, 2017

LGTM, there are some outstanding issues but they aren't impacted but this PR,

@rgbkrk rgbkrk merged commit eb49847 into cloudpipe:master Jul 25, 2017
@rgbkrk
Copy link
Member

rgbkrk commented Jul 25, 2017

Thanks for checking!

@Peque Peque deleted the transient branch July 26, 2017 08:05
ghost pushed a commit to dbtsai/spark that referenced this pull request Aug 22, 2017
## What changes were proposed in this pull request?

Based on apache#18282 by rgbkrk this PR attempts to update to the current released cloudpickle and minimize the difference between Spark cloudpickle and "stock" cloud pickle with the goal of eventually using the stock cloud pickle.

Some notable changes:
* Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80)
* Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90)
* Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88)
* Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85)
* Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72)
* Allow pickling of builtin methods (cloudpipe/cloudpickle#57)
* Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52)
* Support method descriptor (cloudpipe/cloudpickle#46)
* No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32)
* ** Remove non-standard __transient__check (cloudpipe/cloudpickle#110)** -- while we don't use this internally, and have no tests or documentation for its use, downstream code may use __transient__, although it has never been part of the API, if we merge this we should include a note about this in the release notes.
* Support for pickling loggers (yay!) (cloudpipe/cloudpickle#96)
* BUG: Fix crash when pickling dynamic class cycles. (cloudpipe/cloudpickle#102)

## How was this patch tested?

Existing PySpark unit tests + the unit tests from the cloudpickle project on their own.

Author: Holden Karau <holden@us.ibm.com>
Author: Kyle Kelley <rgbkrk@gmail.com>

Closes apache#18734 from holdenk/holden-rgbkrk-cloudpickle-upgrades.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The __transient__ attribute. What is it? Could it be thrown away?
6 participants