Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out-of-bounds memory access in UsedChunkList #2311

Open
tobiasstarkwayve opened this issue Jul 3, 2024 · 9 comments
Open

Out-of-bounds memory access in UsedChunkList #2311

tobiasstarkwayve opened this issue Jul 3, 2024 · 9 comments

Comments

@tobiasstarkwayve
Copy link
Contributor

tobiasstarkwayve commented Jul 3, 2024

Required information

Operating system:

$ uname -a
Linux 5.10.104-l4t-r35.3.ga+g1ef0a8bc637b #1 SMP PREEMPT Wed Jun 5 17:27:33 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

Compiler version:
Clang 16

Eclipse iceoryx version:
v. 2.0.3 (but code is unchanged since then)

Observed result or behaviour:
As we were tracking down a segmentation fault we added the following assertions into the used-chunks list:

Diff

✔ ~/iceoryx [v2.0.3|✚ 2…3] 
12:25 $ git diff
diff --git a/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_receiver_data.hpp b/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_receiver_data.hpp
index 01e1c9ac4..0bcb35975 100644
--- a/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_receiver_data.hpp
+++ b/iceoryx_posh/include/iceoryx_posh/internal/popo/building_blocks/chunk_receiver_data.hpp
@@ -44,6 +44,9 @@ struct ChunkReceiverData : public ChunkQueueDataType
     /// has to return one to not brake the contract. This is aligned with AUTOSAR Adaptive ara::com
     static constexpr uint32_t MAX_CHUNKS_IN_USE = MaxChunksHeldSimultaneously + 1U;
     UsedChunkList<MAX_CHUNKS_IN_USE> m_chunksInUse;
+
+private:
+    void self_check();
 };
 
 } // namespace popo
diff --git a/iceoryx_posh/include/iceoryx_posh/internal/popo/used_chunk_list.inl b/iceoryx_posh/include/iceoryx_posh/internal/popo/used_chunk_list.inl
index eb867880c..310317610 100644
--- a/iceoryx_posh/include/iceoryx_posh/internal/popo/used_chunk_list.inl
+++ b/iceoryx_posh/include/iceoryx_posh/internal/popo/used_chunk_list.inl
@@ -38,9 +38,11 @@ UsedChunkList<Capacity>::UsedChunkList() noexcept
 template <uint32_t Capacity>
 bool UsedChunkList<Capacity>::insert(mepoo::SharedChunk chunk) noexcept
 {
+    self_check();
     auto hasFreeSpace = m_freeListHead != INVALID_INDEX;
     if (hasFreeSpace)
     {
+        iox::cxx::Ensures(m_freeListHead < INVALID_INDEX);
         // get next free entry after freelistHead
         auto nextFree = m_listIndices[m_freeListHead];
 
@@ -48,17 +50,20 @@ bool UsedChunkList<Capacity>::insert(mepoo::SharedChunk chunk) noexcept
         m_listIndices[m_freeListHead] = m_usedListHead;
         m_usedListHead = m_freeListHead;
 
+        iox::cxx::Ensures(m_usedListHead < INVALID_INDEX);
         m_listData[m_usedListHead] = DataElement_t(chunk);
 
         // set freeListHead to the next free entry
         m_freeListHead = nextFree;
 
+        self_check();
         /// @todo can we do this cheaper with a global fence in cleanup?
         m_synchronizer.clear(std::memory_order_release);
         return true;
     }
     else
     {
+        self_check();
         return false;
     }
 }
@@ -67,10 +72,11 @@ template <uint32_t Capacity>
 bool UsedChunkList<Capacity>::remove(const mepoo::ChunkHeader* chunkHeader, mepoo::SharedChunk& chunk) noexcept
 {
     auto previous = INVALID_INDEX;
-
+    self_check();
     // go through usedList with stored chunks
     for (auto current = m_usedListHead; current != INVALID_INDEX; current = m_listIndices[current])
     {
+        iox::cxx::Ensures(current < INVALID_INDEX);
         if (!m_listData[current].isLogicalNullptr())
         {
             // does the entry match the one we want to remove?
@@ -85,6 +91,7 @@ bool UsedChunkList<Capacity>::remove(const mepoo::ChunkHeader* chunkHeader, mepo
                 }
                 else
                 {
+                    iox::cxx::Ensures(previous < INVALID_INDEX);
                     m_listIndices[previous] = m_listIndices[current];
                 }
 
@@ -94,11 +101,13 @@ bool UsedChunkList<Capacity>::remove(const mepoo::ChunkHeader* chunkHeader, mepo
 
                 /// @todo can we do this cheaper with a global fence in cleanup?
                 m_synchronizer.clear(std::memory_order_release);
+                self_check();
                 return true;
             }
         }
         previous = current;
     }
+    self_check();
     return false;
 }
 
@@ -115,7 +124,7 @@ void UsedChunkList<Capacity>::cleanup() noexcept
             data.releaseToSharedChunk();
         }
     }
-
+    self_check();
     init(); // just to save us from the future self
 }
 
@@ -147,9 +156,19 @@ void UsedChunkList<Capacity>::init() noexcept
         data.releaseToSharedChunk();
     }
 
+    self_check();
     m_synchronizer.clear(std::memory_order_release);
 }
 
+template <uint32_t Capacity>
+void UsedChunkList<Capacity>::self_check() noexcept {
+    for (uint32_t i = 0U; i < Capacity; ++i) {
+        iox::cxx::Ensures(m_listIndices[i] <= INVALID_INDEX);
+    }
+    iox::cxx::Ensures(m_freeListHead <= INVALID_INDEX);
+    iox::cxx::Ensures(m_usedListHead <= INVALID_INDEX);    
+}
+
 } // namespace popo
 } // namespace iox

Running with this patch, we observed a violation of the check ahead of the m_listIndices[previous] = m_listIndices[current]; line. This seems like a clear bug to me, as it implies an out-of-bounds access on the array. Do you agree?

Furthermore, it looks to me like this assertion can be triggered only if the else branch is taken in the initial iteration of the loop (as any subsequent iteration will have previous < INVALID_INDEX
as ensured by the loop condition). The only way I can see this happening is if m_usedListHead is modified between the start of the loop and the later if check. Is that possible? If so, does the code properly handle that case? Wouldn't this need some sort of retry-loop?

Expected result or behaviour:
No out-of-bounds accesses

Conditions where it occurred / Performed steps:

The bug is hard to reproduce, but on our system it appears reliably after a few thousand runs of our periodic pipeline. Changing the size of the images being processed affects whether the problem occurs (?), which we assume is due to higher load on the system

Additional helpful information

@elBoberido
Copy link
Member

The only way I can see this happening is if m_usedListHead is modified between the start of the loop and the later if check. Is that possible?

Are you accessing the endpoint (publisher, subscriber, client, server) from multiple threads? I thinks this is the only way for the scenario you describe. The endpoints are not thread safe and you need to use e.g. a smart_lock with them. You can use multiple endpoints with the same ServiceDescription in multiple threads but not the same instance of an endpoint across multiple threads without a mutes.

@tobiasstarkwayve
Copy link
Contributor Author

tobiasstarkwayve commented Jul 4, 2024

The only way I can see this happening is if m_usedListHead is modified between the start of the loop and the later if check. Is that possible?

Are you accessing the endpoint (publisher, subscriber, client, server) from multiple threads? I thinks this is the only way for the scenario you describe. The endpoints are not thread safe and you need to use e.g. a smart_lock with them. You can use multiple endpoints with the same ServiceDescription in multiple threads but not the same instance of an endpoint across multiple threads without a mutes.

@elBoberido I don't think so. We don't interact with the subscription directly at all, just through the ROS executor. And even the multi-threaded ROS executor seems to meticulously return the loaned message right after the callback ran (i.e., while still under the callback-group protection against concurrent runs).

Looking at it from that lens, do the two accesses really need to run in parallel? Or is it enough to run in quick succession from different CPU cores? In that case one load of m_usedListHead might load from a stale CPU cache and the next load might load the updated value from memory.

I don't fully understand the intricacies of the memory fences in this code, so maybe I'm wrong, but without any acquire fence on the cleanup/insert path I think this would be possible, right? (edit: on further investigation I'm. not convinced an acquire fence would even help here. Anyway, let's first identify the problem then speculate about solutions)

@elBoberido
Copy link
Member

@tobiasstarkwayve if the enpoint is properly synchronized by e.g. a mutex, cache synchronization is taken care of. If the endpoint is used only in a single thread and the OS moves the thread to another CPU, the caches are also synchronized.

I didn't have a deep look into the code but was relying on your investigation with modified between the start of the loop and the later if check. Is that possible?. This is not possible if the endpoint is used in the intended way, i.e. only single threaded or all accesses is guarded by the same mutex. The fence in the UsedChunkList is not intended to make it thread safe and accessible from multiple threads but only to have the memory synchronized when the application dies and RouDi does the cleanup, i.e. no concurrent access but transfer of the single threaded ownership to another participant.

@elBoberido
Copy link
Member

elBoberido commented Jul 4, 2024

@tobiasstarkwayve oh, can you print the thread ID in the insert and release methods? One thing to also keep in mind is that the samples also accesses the endpoint when they are released. So if you run in a multi-threaded environment and release samples without holding the mutex, one will also observe the issue you describe

@elBoberido
Copy link
Member

@tobiasstarkwayve I had a closer look at the code and I agree with your assessment. The only way this could happen is when the else branch is taken in the initial loop iteration. The only way this can happen is if there is an unsynchronized access to the UsedChunkList. My bet would be the release of the sample. Do you use the typed or the untyped API.

@tobiasstarkwayve
Copy link
Contributor Author

tobiasstarkwayve commented Jul 4, 2024

@tobiasstarkwayve if the enpoint is properly synchronized by e.g. a mutex, cache synchronization is taken care of. If the endpoint is used only in a single thread and the OS moves the thread to another CPU, the caches are also synchronized.

Yes, that's true. But that's not quite what I meant. What I meant is a subtly different situation: there are two threads, both access the endpoint at different times but are synchronised only by an atomic variable.

Here is the code for the multi-threaded ROS executor (https://github.com/ros2/rclcpp/blob/c743c173e68d92af872cf163e10721a8dbe51dd0/rclcpp/src/rclcpp/executor.cpp). The multiple threads of that multi-threaded executor basically run get_next_executable and execute_any_executable in an endless loop. When get_next_executable picks up a subscription it marks the can_be_taken_from atomicBool as false (preventing it from being picked up again) and when execute_any_executable returns it sets it back to true.

So the scenario I'm thinking of is:

  1. subscription is picked up by thread 1
  2. everything runs, samples are released, etc.
  3. can_be_taken_from is set to true
  4. same subscription is picked up by thread 2 on a different core
  5. everything runs, but there was no synchronisation point so this run is racy with step (2)

Unfortunately I'm not enough of an expert on memory models to know whether this is actually plausible. But I definitely don't see a clear synchronisation point between the two. Granted, the store to can_be_taken_from is sequentially consistent, but I'm not sure the modifications in the usedChunkList "happen before" the store to can_be_taken_from.

Do you use the typed or the untyped API.

This is the stacktrace we observed when the test assertion failed:

Condition: previous < INVALID_INDEX in bool iox::popo::UsedChunkList<257>::remove(const mepoo::ChunkHeader *, mepoo::SharedChunk &) [Capacity = 257] is violated. (iceoryx/iceoryx_posh/include/iceoryx_posh/internal/popo/used_chunk_list.inl:94)
2024-07-02 21:45:34.180 [ Error ]: ICEORYX error! EXPECTS_ENSURES_FAILED
Unhandled C++ exception:
librmw_cyclonedds.so(iox::errorHandler(iox::Error, std::function<void ()> const&, iox::ErrorLevel)+0x68) [0xfffef4cd401c]
librmw_cyclonedds.so(iox::cxx::internal::Require(bool, char const*, int, char const*, char const*)+0x1d4) [0xfffef4cd3ae8]
librmw_cyclonedds.so(iox::popo::UsedChunkList<257u>::remove(iox::mepoo::ChunkHeader const*, iox::mepoo::SharedChunk&)+0x268) [0xfffef4cb4ac0]
librmw_cyclonedds.so(iox::popo::SubscriberPortUser::releaseChunk(iox::mepoo::ChunkHeader const*)+0x3c) [0xfffef4cb3cc4]
librmw_cyclonedds.so(iox_sub_release_chunk+0x60) [0xfffef4ca8a54]
librmw_cyclonedds.so(dds_data_allocator_free+0xc4) [0xfffef4c976b0]
librmw_cyclonedds.so(+0x1b261c) [0xfffef4bd261c]
librmw_cyclonedds.so(+0x198280) [0xfffef4bb8280]
librmw_cyclonedds.so(rmw_return_loaned_message_from_subscription+0x20) [0xfffef4bb7f8c]
our_binary(rcl_return_loaned_message_from_subscription+0x80) [0xaaaaab8528ac]
our_binary(rclcpp::Executor::execute_subscription(std::shared_ptr<rclcpp::SubscriptionBase>)+0x218) [0xaaaaab79a8cc]
our_binary(rclcpp::Executor::execute_any_executable(rclcpp::AnyExecutable&)+0x278) [0xaaaaab799e18]
our_binary(rclcpp::executors::MultiThreadedExecutor::run(unsigned long)+0x2dc) [0xaaaaab7a3820]

So I think that is the untyped API, but I'm not 100% sure.

@elBoberido
Copy link
Member

The can_be_taken_from is used with std::memory_order_seq_cst in store as well as in load. Without digging too deep into the code, this should be sufficient to handle over ownership between threads. But I think the issue is not getting the samples but releasing them. Let's assume the executor moves the subscription to another thread but there are still somewhere samples which are accessed and released from the initial thread. Now one has concurrent access to a data structure which is not thread-safe and data races are introduced. You can test my theory by printing the thread ID and the chunkHeader pointer to the terminal. If I'm correct with my assessment, you will see that there e.g. an insert with tid 1 and chunkHeader 0x42, followed by an insert with tid 2 and chunkHeader 0x13 and then a remove with tid 1 and chunkHeader 0x42 ... the tid might even be something totally different.

If it is Cyclone DDS it is the untyped API indeed. This looks more and more like it is related to how iceoryx is integrated into Cyclone DDS and this is something where I do not have much knowledge about.

@tobiasstarkwayve
Copy link
Contributor Author

The can_be_taken_from is used with std::memory_order_seq_cst in store as well as in load. Without digging too deep into the code, this should be sufficient to handle over ownership between threads. But I think the issue is not getting the samples but releasing them. Let's assume the executor moves the subscription to another thread but there are still somewhere samples which are accessed and released from the initial thread. Now one has concurrent access to a data structure which is not thread-safe and data races are introduced. You can test my theory by printing the thread ID and the chunkHeader pointer to the terminal. If I'm correct with my assessment, you will see that there e.g. an insert with tid 1 and chunkHeader 0x42, followed by an insert with tid 2 and chunkHeader 0x13 and then a remove with tid 1 and chunkHeader 0x42 ... the tid might even be something totally different.

If it is Cyclone DDS it is the untyped API indeed. This looks more and more like it is related to how iceoryx is integrated into Cyclone DDS and this is something where I do not have much knowledge about.

Hmm, I think that can't really happen. The samples are always released when the user callback completes, and the user callback doesn't move threads while it runs. I will try your proposed experiment, though, let's see what comes up.

This looks more and more like it is related to how iceoryx is integrated into Cyclone DDS and this is something where I do not have much knowledge about.

Do you have any recommendations who to ask instead? Is this a question for the CycloneDDS project?

@elBoberido
Copy link
Member

@tobiasstarkwayve I have too little knowledge about the ROS executor but everything points to an unsynchronized access

@eboasson did you experience something similar to what Tobias describes with Cyclone DDS and the multi-threaded ROS executor?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants