Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make PyDict iterator compatible with free-threaded build #4439

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

bschoenmaeckers
Copy link
Contributor

@bschoenmaeckers bschoenmaeckers commented Aug 14, 2024

This pulls in the PyCriticalSection_Begin & PyCriticalSection_End functions new in 3.13 and use it to lock the PyDict iterators as d
described here. I'm not sure about the PyCriticalSection struct definition. We cannot use the opaque_struct! macro to define this struct because we have to allocate enough space on the stack so we can pass the uninitialized pointer to PyCriticalSection_Begin. So some help would be appreciated!

depends on #4421
related to #4265

@bschoenmaeckers bschoenmaeckers force-pushed the PyDict_next_lock branch 2 times, most recently from d90dc42 to c0136f7 Compare August 14, 2024 13:57
@ngoldbaum
Copy link
Contributor

I actually have a branch with these changes (more or less) that I was planning to do separately from that PR. Unfortunately the deadlock I found is caused by something else.

If you're planning to work on this stuff I'd appreciate it if you could comment on the tracking issue so we can coordinate work and avoid duplication.

Copy link
Member

@mejrs mejrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use of the critical section api seems unwise. This allows users to create several critical sections (and worse) allows them to release them in arbitrary order. I don't think I understand the critical section api well but it seems guaranteed to cause issues.

I can see two obvious solutions:

  1. Replace the implementation with PyObject_GetIter and PyIter_Next (slow?)
  2. Implement some form of internal iteration:
impl PyDict{
    pub fn traverse<B>(&self, f: &mut impl FnMut(Bound<'py, PyAny>, Bound<'py, PyAny>) -> ControlFlow<B>) -> ControlFlow<B> {
        struct Guard { .. };
        impl Drop for Guard { ..release critical section }
        
        let mut cs = std::mem::MaybeUninit::zeroed();
        ffi::PyCriticalSection_Begin(cs.as_mut_ptr(), dict.as_ptr());
        let mut ma_used = ..;
        let mut di_used = ..;
        let key = ...;
        let value = ..;
        
        while PyDict_Next(...) != 0{
           f(key, value)?;
        }
        ControlFlow::Continue(())
    }
}

Comment on lines 559 to 563
let cs = unsafe {
let mut cs = std::mem::MaybeUninit::zeroed();
ffi::PyCriticalSection_Begin(cs.as_mut_ptr(), dict.as_ptr());
cs.assume_init()
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can just assume_init immediately because it is zero-valid. This would only be necessary if you used MaybeUninit::uninit().

Suggested change
let cs = unsafe {
let mut cs = std::mem::MaybeUninit::zeroed();
ffi::PyCriticalSection_Begin(cs.as_mut_ptr(), dict.as_ptr());
cs.assume_init()
};
let cs: ffi::PyCriticalSection = unsafe { std::mem::MaybeUninit::zeroed().assume_init() };
unsafe { ffi::PyCriticalSection_Begin(cs.as_mut_ptr(), dict.as_ptr()) };

Comment on lines 545 to 552
#[cfg(Py_GIL_DISABLED)]
impl Drop for BorrowedDictIter<'_, '_> {
fn drop(&mut self) {
unsafe {
ffi::PyCriticalSection_End(&mut self.cs);
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should probably implement Drop unconditionally (or not at all)

Suggested change
#[cfg(Py_GIL_DISABLED)]
impl Drop for BorrowedDictIter<'_, '_> {
fn drop(&mut self) {
unsafe {
ffi::PyCriticalSection_End(&mut self.cs);
}
}
}
impl Drop for BorrowedDictIter<'_, '_> {
fn drop(&mut self) {
#[cfg(Py_GIL_DISABLED)]
unsafe {
ffi::PyCriticalSection_End(&mut self.cs);
}
}
}

@davidhewitt
Copy link
Member

Replace the implementation with PyObject_GetIter and PyIter_Next (slow?)

I think we should seriously consider going this way and benchmark if it's actually a performance concern. We already made the same change for sets a couple of releases back, it wasn't a major performance impact there compared to the wins from the Bound API. Two reasons why we did it for sets:

  • _PySet_Next (or whatever the API was called) was a private API
  • It doesn't do the right thing for subclasses of sets with custom __iter__ functions

Similarly our current implementation doesn't respect dict subclasses with custom __iter__ functions? Should it? Probably, in which case we might just want to switch to PyObject_GetIter anyway.

@ngoldbaum
Copy link
Contributor

I opened #4477 with a different implementation of the FFI bindings.

@bschoenmaeckers
Copy link
Contributor Author

I opened #4477 with a different implementation of the FFI bindings.

Sorry for the late reply. Your implementation looks good, and the 'opaque_type!' use is exactly like what I was looking for. I will update my MR after the weekend.

@bschoenmaeckers
Copy link
Contributor Author

This use of the critical section api seems unwise. This allows users to create several critical sections (and worse) allows them to release them in arbitrary order. I don't think I understand the critical section api well but it seems guaranteed to cause issues.

I can see two obvious solutions:

  1. Replace the implementation with PyObject_GetIter and PyIter_Next (slow?)

  2. Implement some form of internal iteration:

impl PyDict{

    pub fn traverse<B>(&self, f: &mut impl FnMut(Bound<'py, PyAny>, Bound<'py, PyAny>) -> ControlFlow<B>) -> ControlFlow<B> {

        struct Guard { .. };

        impl Drop for Guard { ..release critical section }

        

        let mut cs = std::mem::MaybeUninit::zeroed();

        ffi::PyCriticalSection_Begin(cs.as_mut_ptr(), dict.as_ptr());

        let mut ma_used = ..;

        let mut di_used = ..;

        let key = ...;

        let value = ..;

        

        while PyDict_Next(...) != 0{

           f(key, value)?;

        }

        ControlFlow::Continue(())

    }

}


Interesting solutions 👀. I will try to implement the first one and test the performance hit after.

Copy link

codspeed-hq bot commented Aug 28, 2024

CodSpeed Performance Report

Merging #4439 will not alter performance

Comparing bschoenmaeckers:PyDict_next_lock (3ace91a) with main (3cfa04f)

Summary

✅ 81 untouched benchmarks

@ngoldbaum
Copy link
Contributor

ouch that does seem to be a big perf hit

@bschoenmaeckers
Copy link
Contributor Author

bschoenmaeckers commented Aug 28, 2024

Yea this is really bad, but kind of expected as dict.items() creates a copy of the iterable and saves it into a PyList.

https://github.com/python/cpython/blob/main/Objects/dictobject.c#L3381-L3432

@bschoenmaeckers
Copy link
Contributor Author

I've also looked into iterating a raw dict but this only yields the keys. So it does not protect against modifications of the values before fetching them on the next call.

@ngoldbaum
Copy link
Contributor

I wonder if the critical section API is actually problematic in practice. You could try iterating over the same dict in many threads on the free-threaded build as a stress test. I'm not sure if there are other usage patterns that @mejrs might be concerned about.

It would be nice if we could still keep the fast path for dicts and then only degrade to the slow path if we're not handed an instance of PyDict_Type.

@davidhewitt
Copy link
Member

Yea this is really bad, but kind of expected as dict.items() creates a copy of the iterable and saves it into a PyList.

https://github.com/python/cpython/blob/main/Objects/dictobject.c#L3381-L3432

dict.items() is equivalent to the Python 2 semantics where .items() in Python did create a new list. Is perf any better if you try .dict.call_method0("items") to get an iterable items view?

Comment on lines 385 to 387
let tuple = pair.downcast::<PyTuple>().unwrap();
let key = tuple.get_item(0).unwrap();
let value = tuple.get_item(1).unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it wise to use the unchecked variants here instead of unwrap?

Copy link
Contributor Author

@bschoenmaeckers bschoenmaeckers Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about this, it is probably not safe, because items() method can return an arbitrary object when overridden in python code.

@bschoenmaeckers
Copy link
Contributor Author

bschoenmaeckers commented Aug 28, 2024

Yea this is really bad, but kind of expected as dict.items() creates a copy of the iterable and saves it into a PyList.
https://github.com/python/cpython/blob/main/Objects/dictobject.c#L3381-L3432

dict.items() is equivalent to the Python 2 semantics where .items() in Python did create a new list. Is perf any better if you try .dict.call_method0("items") to get an iterable items view?

I didn't know that this is different, learning something new every day! It is indeed somewhat faster. We went down from ~87% slowdown to ~63%.

@bschoenmaeckers bschoenmaeckers changed the title Add PyCriticalSection lock to Dict iterator Make PyDict iterator compatible with free-threaded build Aug 29, 2024
@bschoenmaeckers
Copy link
Contributor Author

It would be nice if we could still keep the fast path for dicts and then only degrade to the slow path if we're not handed an instance of PyDict_Type.

I made the previous fast path available to non-freethreaded builds when the dict is not a subtype of PyDict. This gives us minimal performance regressions on existing code.

@bschoenmaeckers bschoenmaeckers force-pushed the PyDict_next_lock branch 3 times, most recently from 0b8f4c6 to ae0ee72 Compare August 29, 2024 16:53
Comment on lines +440 to +454

if unsafe { ffi::PyDict_Next(dict.as_ptr(), ppos, &mut key, &mut value) } != 0 {
*remaining -= 1;
let py = dict.py();
// Safety:
// - PyDict_Next returns borrowed values
// - we have already checked that `PyDict_Next` succeeded, so we can assume these to be non-null
Some((
unsafe { key.assume_borrowed_unchecked(py) }.to_owned(),
unsafe { value.assume_borrowed_unchecked(py) }.to_owned(),
))
} else {
None
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an alternative implementation here where we add a critical section internally here just around the call to PyDict_Next? It means that each iteration has to lock / unlock a mutex, which might also be terrible for performance, but it'd be interesting to try. (If it performs acceptably, we could then also ask freethreaded CPython experts if this is sound. My hunch is that it would be.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thought of this implementation but as of the following issue this is not sufficient.

python/cpython#120858

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading correctly, isn't the point of that issue precisely that it's permitting us to add locking here around each call to PyDict_Next if we so wanted? The concern about borrowed references is not relevant here because we immediately incref them, and we can do that before releasing the critical section. Cc @colesbury

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidhewitt is right about the borrowed references issue not being relevant here because PyO3 would be doing it's own locking around PyDict_Next() with incref inside the lock.

That's still not ideal, but it might be a reasonable starting point. It's much better to lock around the entire loop, both because of the performance issue and because you will see a consistent view of the dict. The locking only around PyDict_Next() allows for concurrent modifications in between each call, so you're going to have more cases that panic due to concurrent modifications, which would have been prevented by the GIL or a loop-scoped lock.

Another alternative is to copy the dict inside the iterator and iterate over the copy. It's probably cheaper than locking around each call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clearing this up! Copying the dict sounds like the easiest solution for now. When we finalize a critical section api we can consider moving the responsibility of locking the dict (during the whole iteration) on free-threaded builds to the user and remove the copy() and panic on concurrent modifications.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I think if the loop executes arbitrary python code then it is still possible for the dict to be modified during iteration under a critical section, because it may be suspended by a nested section which then modifies the dict.

I feel like users are more likely to be able to know for their use case if copying or locking per iteration is more acceptable. I wonder if we need to split .iter() into multiple methods?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #4571 to suggest a way forward on this.

@bschoenmaeckers bschoenmaeckers marked this pull request as ready for review September 3, 2024 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants