-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out-of-bounds memory access in UsedChunkList #2311
Comments
Are you accessing the endpoint (publisher, subscriber, client, server) from multiple threads? I thinks this is the only way for the scenario you describe. The endpoints are not thread safe and you need to use e.g. a |
@elBoberido I don't think so. We don't interact with the subscription directly at all, just through the ROS executor. And even the multi-threaded ROS executor seems to meticulously return the loaned message right after the callback ran (i.e., while still under the callback-group protection against concurrent runs). Looking at it from that lens, do the two accesses really need to run in parallel? Or is it enough to run in quick succession from different CPU cores? In that case one load of I don't fully understand the intricacies of the memory fences in this code, so maybe I'm wrong, but |
@tobiasstarkwayve if the enpoint is properly synchronized by e.g. a mutex, cache synchronization is taken care of. If the endpoint is used only in a single thread and the OS moves the thread to another CPU, the caches are also synchronized. I didn't have a deep look into the code but was relying on your investigation with |
@tobiasstarkwayve oh, can you print the thread ID in the |
@tobiasstarkwayve I had a closer look at the code and I agree with your assessment. The only way this could happen is when the else branch is taken in the initial loop iteration. The only way this can happen is if there is an unsynchronized access to the |
Yes, that's true. But that's not quite what I meant. What I meant is a subtly different situation: there are two threads, both access the endpoint at different times but are synchronised only by an atomic variable. Here is the code for the multi-threaded ROS executor (https://github.com/ros2/rclcpp/blob/c743c173e68d92af872cf163e10721a8dbe51dd0/rclcpp/src/rclcpp/executor.cpp). The multiple threads of that multi-threaded executor basically run So the scenario I'm thinking of is:
Unfortunately I'm not enough of an expert on memory models to know whether this is actually plausible. But I definitely don't see a clear synchronisation point between the two. Granted, the store to
This is the stacktrace we observed when the test assertion failed:
So I think that is the untyped API, but I'm not 100% sure. |
The If it is Cyclone DDS it is the untyped API indeed. This looks more and more like it is related to how iceoryx is integrated into Cyclone DDS and this is something where I do not have much knowledge about. |
Hmm, I think that can't really happen. The samples are always released when the user callback completes, and the user callback doesn't move threads while it runs. I will try your proposed experiment, though, let's see what comes up.
Do you have any recommendations who to ask instead? Is this a question for the CycloneDDS project? |
@tobiasstarkwayve I have too little knowledge about the ROS executor but everything points to an unsynchronized access @eboasson did you experience something similar to what Tobias describes with Cyclone DDS and the multi-threaded ROS executor? |
Required information
Operating system:
Compiler version:
Clang 16
Eclipse iceoryx version:
v. 2.0.3 (but code is unchanged since then)
Observed result or behaviour:
As we were tracking down a segmentation fault we added the following assertions into the used-chunks list:
Diff
Running with this patch, we observed a violation of the check ahead of the
m_listIndices[previous] = m_listIndices[current];
line. This seems like a clear bug to me, as it implies an out-of-bounds access on the array. Do you agree?Furthermore, it looks to me like this assertion can be triggered only if the else branch is taken in the initial iteration of the loop (as any subsequent iteration will have previous < INVALID_INDEX
as ensured by the loop condition). The only way I can see this happening is if m_usedListHead is modified between the start of the loop and the later if check. Is that possible? If so, does the code properly handle that case? Wouldn't this need some sort of retry-loop?
Expected result or behaviour:
No out-of-bounds accesses
Conditions where it occurred / Performed steps:
The bug is hard to reproduce, but on our system it appears reliably after a few thousand runs of our periodic pipeline. Changing the size of the images being processed affects whether the problem occurs (?), which we assume is due to higher load on the system
Additional helpful information
The text was updated successfully, but these errors were encountered: