Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to pickle dynamically create modules #52

Merged
merged 1 commit into from
Feb 3, 2016

Conversation

rodrigofarnhamsc
Copy link
Contributor

The old logic treated all modules the same, which would fail when unpickling.
In save_module detect whether the module has been dynamically created by following the chain of imports. Noteworthy is that imp.find_module doesn't work with submodules (example sckit.tree), so we actually have to split the module name and iterate over each piece.

Dynamic modules are saved as dictionaries and reconstituted by dynamic_subimport function. While working on the test cases I discovered NotImplemented and Ellipsis also don't work properly (they are introduced into the test dynamic module by exec). I've also addressed that.

@rgbkrk
Copy link
Member

rgbkrk commented Jan 30, 2016

What do you think @mrocklin? Seems like a good contribution.

_, path, _ = imp.find_module(part, path)
is_dynamic = False
except ImportError:
is_dynamic = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to pull this logic out into a separate function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

@ogrisel
Copy link
Contributor

ogrisel commented Jan 31, 2016

LGTM but I think the commits could be all squashed together.

@ygravrand
Copy link
Contributor

Great contribution, thanks !

@jakirkham
Copy link
Member

Sounds like you are already fixing a case that I am interested in. :) Namely handling Ellipsis objects. ( #54 ) Feel free to close it once this is merged.

@rodrigofarnhamsc
Copy link
Contributor Author

I've extracted out the _find_module function and squashed the commits together.

@jakirkham
Copy link
Member

Would it be worth testing the _find_module function directly? Honestly, idk because it seems it is being covered indirectly and you have marked it as an internal detail, but something to think about.

@rodrigofarnhamsc rodrigofarnhamsc force-pushed the dynamicimports branch 2 times, most recently from 3b8b90c to 340d175 Compare February 1, 2016 21:32
@rodrigofarnhamsc
Copy link
Contributor Author

I had forgotten to pass obj to save_reduce which meant dynamic modules were not treated as singletons. That actually raises an interesting question of how to handle unpickling dynamic modules in environment where there is naming collisions. For example:

my_mod = dyn_mod_gen()
m1, m2 = pickle_depickle([my_mod, my_mod])

# m1 is identical to m2, but they are different from my_mod

One possible way to address that issue is to peek at sys.modules and try to match module names with existing modules. But that seems very magical, and imp.new_module doesn't add created modules to sys.modules.

For my use case, Big Query API in python notebooks produces dynamically generated modules which I want to save for later.

@rodrigofarnhamsc
Copy link
Contributor Author

@jakirkham I can add a test for _find_module. I just didn't want to crowd the test file with my so many tests for my feature, considering the other, arguably more important features, are not as densely tested.

self.modules.add(obj)
self.save_reduce(subimport, (obj.__name__,), obj=obj)
if is_dynamic:
self.save_reduce(dynamic_subimport, (obj.__name__, vars(obj)), obj=obj)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obj=obj here is actually important to enable memoization, but it doesn't ensure that the produced dynamic modules will be singletons. Example:

dynmod.x = 1
m1, m2 = pickle.depickle([dynmod, dynmod])
m1.x = 2  # m1 and m2 are identical, m2.x is also 2
m3 = pickle.depickle(dynmod)
# m3 is a different module altogether

@jakirkham
Copy link
Member

My feeling at least is that having good test coverage of the handling of dynamic modules will help keep things sane. Again they will probably be copy-and-paste versions of your existing tests that are trimmed down a bit. Still, if something goes wrong here, it is nice to be able to rule out some simpler cases and having these already written out will make it easier to construct variants. Again just a thought.

@rodrigofarnhamsc
Copy link
Contributor Author

@jakirkham, I added tests for _find_module. I wish there was a way for me to verify that it correctly finds submodules. I could for example use scipy.io which is a submodule incorrectly handled by imp.find_module. But if scipy were refactored such that scipy.io was no longer a submodule the test would not longer work.

Add test cases, special case Ellipsis and NotImplemented.
Use custom logic in lieu of imp.find_module to properly follow subimports. For example sklearn.tree was spuriously treated as a dynamic module.
@rgbkrk
Copy link
Member

rgbkrk commented Feb 2, 2016

if scipy were refactored such that scipy.io was no longer a submodule the test would not longer work.

we can always change that again later, and it would just be a test/dev dependency

@rodrigofarnhamsc
Copy link
Contributor Author

@rgbkrk the test would have to be of the following form then:

with pytest.raises(ImportError):
  imp.find_module('scipy.io')

find_module('scipy.io')

To ensure that io remains a submodule of scipy.

@jakirkham
Copy link
Member

What about doing something simple like testing cloudpickle.cloudpickle or would that not work?

Alternatively, I suppose you could create a dummy submodule in the tests directory for this purpose. A bit ugly perhaps, but not unreasonable. Something like foo/bar so it would not be picked up by the test runner.

@jakirkham
Copy link
Member

Interesting fact, apparently Python 3.3+ fixed the pickling of NotImplemented and Ellipsis. Though this is not the case for Python 2.7.11. They also showed up in the exact same bug report. ( http://bugs.python.org/issue13842 ) So must just not be backported, but not sure why it hasn't happened yet.

rgbkrk added a commit that referenced this pull request Feb 3, 2016
Add ability to pickle dynamically create modules
@rgbkrk rgbkrk merged commit 7997901 into cloudpipe:master Feb 3, 2016
@rgbkrk
Copy link
Member

rgbkrk commented Feb 3, 2016

Thank you @rodrigofarnhamsc! I finally noticed the squashing of commits, sorry that took a little while to come back to.

@jakirkham
Copy link
Member

Thanks. This is awesome. Any chance we might see this in a release any time soon?

@rgbkrk
Copy link
Member

rgbkrk commented Feb 3, 2016

We were giving @mrocklin some time for #45 (see #46 (comment)), though we can push it out sooner.

@mrocklin
Copy link
Contributor

mrocklin commented Feb 3, 2016

I don't plan to have a solution to that in the near future. I think we should release soon.

@jakirkham
Copy link
Member

No need to rush. I built a development copy of cloudpickle with these changes, which is serving me quite nicely. Just please let me know when it comes out so I can switch over to a proper release.

@rgbkrk
Copy link
Member

rgbkrk commented Feb 3, 2016

@mrocklin sounds good!

@jakirkham
Copy link
Member

Somehow it didn't show @mrocklin's comment while I was writing that. Either way. I'll let you guys decide.

rgbkrk added a commit to rgbkrk/spark that referenced this pull request Jun 12, 2017
This brings in fixes and upgrades from the [cloudpickle](https://github.com/cloudpipe/cloudpickle) module, notably:

* Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80)
* Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90)
* Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88)
* Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85)
* Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72)
* Allow pickling of builtin methods (cloudpipe/cloudpickle#57)
* Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52)
* Support method descriptor (cloudpipe/cloudpickle#46)
* No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32)
ghost pushed a commit to dbtsai/spark that referenced this pull request Aug 22, 2017
## What changes were proposed in this pull request?

Based on apache#18282 by rgbkrk this PR attempts to update to the current released cloudpickle and minimize the difference between Spark cloudpickle and "stock" cloud pickle with the goal of eventually using the stock cloud pickle.

Some notable changes:
* Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80)
* Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90)
* Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88)
* Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85)
* Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72)
* Allow pickling of builtin methods (cloudpipe/cloudpickle#57)
* Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52)
* Support method descriptor (cloudpipe/cloudpickle#46)
* No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32)
* ** Remove non-standard __transient__check (cloudpipe/cloudpickle#110)** -- while we don't use this internally, and have no tests or documentation for its use, downstream code may use __transient__, although it has never been part of the API, if we merge this we should include a note about this in the release notes.
* Support for pickling loggers (yay!) (cloudpipe/cloudpickle#96)
* BUG: Fix crash when pickling dynamic class cycles. (cloudpipe/cloudpickle#102)

## How was this patch tested?

Existing PySpark unit tests + the unit tests from the cloudpickle project on their own.

Author: Holden Karau <holden@us.ibm.com>
Author: Kyle Kelley <rgbkrk@gmail.com>

Closes apache#18734 from holdenk/holden-rgbkrk-cloudpickle-upgrades.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants