Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pickle Support #100

Open
SerialDev opened this issue Dec 27, 2017 · 18 comments
Open

Pickle Support #100

SerialDev opened this issue Dec 27, 2017 · 18 comments

Comments

@SerialDev
Copy link

SerialDev commented Dec 27, 2017

As of right now its not possible to pickle classes created by PyO3.
This feature would be invaluable for situations where some form of persistence would be desireable.

As of right now it has trouble pickling after I call

    #[new]
    fn __new__(obj: &PyRawObject) -> PyResult<()>{

Otherwise the .__dict__ attributes are maintained prior to initialization with __new__

@rth
Copy link

rth commented Dec 30, 2018

Would this mean implementing __getstate__ and __setstate__ methods (cf https://docs.python.org/3/library/pickle.html#pickling-class-instances)?

For instance, the way pickling works for,

might provide some examples.

@rth
Copy link

rth commented May 5, 2019

For instance, if we take the documentation example for MyClass,

# use pyo3::prelude::*;
# use pyo3::PyRawObject;
#[pyclass]
struct MyClass {
   num: i32,
}

#[pymethods]
impl MyClass {

     #[new]
     fn new(obj: &PyRawObject, num: i32) {
         obj.init({
             MyClass {
                 num,
             }
         });
     }
}

by default we get the following error when pickling this class,

        obj = MyClass()
    
>       pickle.dumps(obj)
E       TypeError: can't pickle MyClass objects

If we now add the __getstate__/ __setstate__ methods,

    fn __getstate__(&self) -> PyResult<(i32)> {
        Ok(self.num)
    }

    fn __setstate__(&mut self, state: i32) -> PyResult<()> {
        self.num = state;
        Ok(())
    }

we get another exception,

_pickle.PicklingError: Can't pickle <class 'MyClass'>: attribute lookup MyClass on builtins failed

There is some additional step I must be missing here.

@althonos
Copy link
Member

althonos commented May 6, 2019

@rth : this may be related to the fact that PyO3 exposes all classes as part of the builtins module, because the import mechanism has not been properly implemented, so pickle tries to use builtins.MyClass and fails with the error you reported.

@rth
Copy link

rth commented May 6, 2019

Thanks @althonos ! Opened a separate issue about it in #474

@rth
Copy link

rth commented May 6, 2019

So by subclassing , to set the __module__ correctly as suggested in #474 (comment), pickling seems to work.

Though, I get a segfault occasionally (i.e. it does seem to be random) at exit. For instance when running a pytest session where one test checks pickling,

gdb --args python3.7 -m pytest -k test_pickle
GNU gdb (GDB) CentOS (7.0.1-45.el5.centos)
[...]
Reading symbols from /opt/_internal/cpython-3.7.1/bin/python3.7...(no debugging symbols found)...done.
(gdb) run
Starting program: /opt/_internal/cpython-3.7.1/bin/python3.7 -m pytest -k test_pickle
warning: Error disabling address space randomization: Operation not permitted
============================================================= test session starts =============================================================
platform linux -- Python 3.7.1, pytest-4.4.1, py-1.8.0, pluggy-0.9.0 -- /opt/_internal/cpython-3.7.1/bin/python3.7
cachedir: .pytest_cache
rootdir: /src/python
collected 1 items  / 1 selected                                                                                               

my_module/test_pickle.py::test_pickle PASSED

=================================================== 1 passed  in 0.13 seconds ===================================================
During startup program terminated with signal SIGSEGV, Segmentation fault.
(gdb) bt
No stack.

and there is no backtrace. Will try to investigate it later.

@konstin
Copy link
Member

konstin commented May 6, 2019

The segfault likely occurs because subclassing is broken

@gilescope
Copy link
Contributor

How about trying dill? Pickle can't handle lots of pure python serialisation cases.
https://pypi.org/project/dill/

@davidhewitt
Copy link
Member

Not sure if it's interesting; this snippet just got shared on gitter. https://gist.github.com/ethanhs/fd4123487974c91c7e5960acc9aa2a77

@shaolo1
Copy link

shaolo1 commented Oct 19, 2020

I've got a simple struct that I need to deepcopy. I'm trying to figure out how to pickle my struct (after getting the TypeError: cannot pickle error). The gist above shows how to do it for a single member, but I'm too much of a newb to see how to do this with multiple members.

I tried

pub fn __getstate__(&self, py: Python) -> PyResult<PyObject> {
        Ok(PyBytes::new(py, &serialize(&self.foo).unwrap()).to_object(py))
        Ok(PyBytes::new(py, &serialize(&self.bar).unwrap()).to_object(py))
    }

..but get an error "expected one of ., ;, ?, }, or an operator" after the first OK.

@davidhewitt
Copy link
Member

@shaolo1 I would just return the tuple of members:

    pub fn __getstate__(&self, py: Python) -> PyObject {
        (
            PyBytes::new(py, &serialize(&self.foo)?),
            PyBytes::new(py, &serialize(&self.bar)?),
        ).to_object(py)
    }

@shaolo1
Copy link

shaolo1 commented Oct 24, 2020

@davidhewitt Thanks. I'll try that if I encounter it again. I got around the problem by just implementing deepcopy in the parent object and handling the copy there so that pickle support was not needed in my rust object.

@kylecarow
Copy link

kylecarow commented Aug 18, 2022

I was able to enable pickling by writing the __getstate__, __setstate__, and __getnewargs__ magic methods in pymethods for a pure Rust project using bincode::{deserialize, serialize}. In __getnewargs__ you need to return a tuple of all the arguments __new__ will use on deserializaton, otherwise you'll see something like TypeError: MyStruct.__new__() missing 2 required positional arguments: 'my_first_arg' and 'my_second_arg'.

Here is a generic example:

pub fn __setstate__(&mut self, state: Vec<u8>) -> PyResult<()> {
    *self = deserialize(&state).unwrap();
    Ok(())
}
pub fn __getstate__(&self) -> PyResult<Vec<u8>> {
    Ok(serialize(&self).unwrap())
}
pub fn __getnewargs__(&self) -> PyResult<(f64, f64)> {
    Ok((self.my_first_arg, self.my_second_arg))
}

Also, here is a code example for the workaround @shaolo1 mentioned. Cloning for deepcopy may be faster than serializing & deserializing (which I guess is how Python deepcopies normally?), but I haven't tested that.

pub fn copy(&self) -> Self {self.clone()}
pub fn __copy__(&self) -> Self {self.clone()}
pub fn __deepcopy__(&self, _memo: &PyDict) -> Self {self.clone()}

That'll allow you to return a clone using copy.copy(), copy.deepcopy(), or by calling the .copy() method.

Edits:

  • Also important to note I needed to change #[pyclass] to #[pyclass(module = "mymodulename")]
  • It seems like bincode is performing rather slow, I'm trying to figure out how to use serde_bytes to speed things up. Maybe in conjuction with PyBytes? Though I want to avoid the GIL wherever I possibly can.

@davidhewitt
Copy link
Member

Yes, Vec<u8> will cast each byte in turn into a Python list. I think you do need to use PyBytes here, and it's irrelevant that you want to avoid the GIL because these are Python methods you're implementing.

I think you want something like this:

pub fn __setstate__(&mut self, state: &PyBytes) -> PyResult<()> {
    *self = deserialize(state.as_bytes()).unwrap();
    Ok(())
}
pub fn __getstate__<'py>(&self, py: Python<'py>) -> PyResult<&'py PyBytes> {
    Ok(PyBytes::new(py, serialize(&self).unwrap()))
}
pub fn __getnewargs__(&self) -> PyResult<(f64, f64)> {
    Ok((self.my_first_arg, self.my_second_arg))
}

I would also strongly recommend you replace .unwrap() with conversion to actual PyResult errors :)

@kylecarow
Copy link

kylecarow commented Aug 22, 2022

Woah, yeah that sped up my round trip serializing and deserializing benchmark by 100x. And thanks for the tip about PyResult errors. I did have to modify __getstate__ ever so slightly to add a reference:

pub fn __getstate__<'py>(&self, py: Python<'py>) -> PyResult<&'py PyBytes> {
    Ok(PyBytes::new(py, &serialize(&self).unwrap()))
}

I also did some benchmarking with my structs regarding the performance of cloning vs. roundtrip pickling and bincode serde, which might be useful to someone:

  • Having a __deepcopy__ pymethod that calls .clone() is by far the fastest way I've found of copying a pyo3 object. My benchmark took 1.38 usec
  • The next best thing is having bincode serde methods, which roundtrip took 15.6 usec (before the PyBytes change it took 1.28 msec)
    pub fn to_bincode<'py>(&self, py: Python<'py>) -> PyResult<&'py PyBytes> {
        Ok(PyBytes::new(py, &serialize(&self).unwrap()))
    }
    #[classmethod]
    pub fn from_bincode(_cls: &PyType, encoded: &PyBytes) -> PyResult<Self> {
        Ok(deserialize(encoded.as_bytes()).unwrap())
    }
  • The least performant is pickling, as expected. I guess Python has a lot more overhead here. It took 439 usec roundtrip.

@lycantropos
Copy link
Contributor

lycantropos commented Jun 13, 2023

since __setstate__ requires a mutable reference is there a possibility to have a pickle support for a #[pyclass(frozen)] class?

never mind, I've switched to __reduce__ method

https://github.com/lycantropos/rithm/blob/765d1990800d47e169f84912b16a9857c0575fff/src/lib.rs#L441-L449

@davidhewitt
Copy link
Member

You can also use __getnewargs__ or __getnewargs_ex__, which is the simplest option if you can pass all your state directly back to #[new] when unpickling (I would guess this is true for most frozen classes).

@Stargateur
Copy link

there is no way to avoid construct the object before ? I can't understand why they forced to construct the object and THEN set it to specific state. https://peps.python.org/pep-0307 is very hard to read

@Stargateur
Copy link

If like me you have trouble to use reduce here a very simple example:

#[staticmethod]
pub fn deserialize(data: Vec<u8>) -> Self {
    Foo {
        inner: rmp_serde::from_slice(&data).unwrap(),
    }
}

pub fn __reduce__(&self) -> (PyObject, PyObject) {
    Python::with_gil(|py| {
        py.run_bound("import mylib", None, None).unwrap();
        let deserialize = py.eval_bound("mylib.Foo.deserialize", None, None).unwrap();
        let data = rmp_serde::to_vec(&self.inner).unwrap;
        (deserialize.to_object(py), (data,).to_object(py))
    })
}

If someone have a way to avoid run_bound() and eval_bound I'm all hear. Look https://docs.rs/pyo3/0.22.2/pyo3/types/struct.PyCFunction.html#method.new_with_keywords_bound could be used but... don't know how.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants