Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] A thread is stuck waiting to close an SRT socket due to an internal SRT mutex lock. #2944

Open
tvaknin opened this issue Apr 25, 2024 · 21 comments
Labels
[core] Area: Changes in SRT library core Type: Bug Indicates an unexpected problem or unintended behavior
Milestone

Comments

@tvaknin
Copy link

tvaknin commented Apr 25, 2024

Hi everyone,
Potential issue: A thread is stuck waiting to close an SRT socket due to an internal SRT mutex lock.

The Backtrace of the thread being blocked:

#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103 #1 0x00007f6ad3751411 in __GI___pthread_mutex_lock (mutex=0x7f6ad4c1e480 <srt::CUDT::uglobal()::instance+96>) at ../nptl/pthread_mutex_lock.c:80 #2 0x00007f6ad4b53b33 in srt::CUDTUnited::locateSocket(int, srt::CUDTUnited::ErrorHandling) () from ../libs/libsrt.so.1.5 #3 0x00007f6ad4b5bc58 in srt::CUDTUnited::close(int) () from ../libs/libsrt.so.1.5 #4 0x00007f6ad4b5bcec in srt::CUDT::close(int) () from ../libs/libsrt.so.1.5 #5 0x00000000004c1e78 in SrtConnectionManager::CheckForConnection (this=0x10ecea0 <SrtConnectionManager::GetInstance()::inst>) at muxer/SrtConnectionManager.cpp:244

I have a few more threads ~6 that are also stuck on this Mutex. e.g.

#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103 #1 0x00007f954a5e0411 in __GI___pthread_mutex_lock (mutex=0x7f954baad480 <srt::CUDT::uglobal()::instance+96>) at ../nptl/pthread_mutex_lock.c:80 #2 0x00007f954b9e2b33 in srt::CUDTUnited::locateSocket(int, srt::CUDTUnited::ErrorHandling) () from../libs/libsrt.so.1.5 #3 0x00007f954b9e623c in srt::CUDT::sendmsg2(int, char const*, int, SRT_MsgCtrl_&) () from ../libs/libsrt.so.1.5 #4 0x00007f954b9e629f in srt::CUDT::send(int, char const*, int, int) () from ../libs/libsrt.so.1.5 #5 0x00000000004dad3d in OutSrtSocket::Send (successCnt=0x2f8d484, this=0x7f950194eeb0) at Revioly/OutSrtSocket.cpp:32

I also found that the SRT thread "SRT:GC" holds it too:
41 Thread 0x7f94feffd700 (LWP 32188) "SRT:GC" __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103

#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103 #1 0x00007f954a5e0411 in __GI___pthread_mutex_lock (mutex=0x7f954baad480 <srt::CUDT::uglobal()::instance+96>) at ../nptl/pthread_mutex_lock.c:80 #2 0x00007f954b9ed00e in srt::CUDTUnited::checkBrokenSockets() () from ../libs/libsrt.so.1.5 #3 0x00007f954b9ed6a8 in srt::CUDTUnited::garbageCollect(void*) () from ../libs/libsrt.so.1.5 #4 0x00007f954a5ddefc in start_thread (arg=<optimized out>) at pthread_create.c:479 #5 0x00007f9549ec122f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

SRT internal sockets thread example:

#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103 #1 0x00007f954a5e0411 in __GI___pthread_mutex_lock (mutex=0x7f954baad480 <srt::CUDT::uglobal()::instance+96>) at ../nptl/pthread_mutex_lock.c:80 #2 0x00007f954b9ebfb8 in srt::CUDTUnited::newConnection(int, srt::sockaddr_any const&, srt::CPacket const&, srt::CHandShake&, int&, srt::CUDT*&) () from ../libs/libsrt.so.1.5 #3 0x00007f954ba1948d in srt::CUDT::processConnectRequest(srt::sockaddr_any const&, srt::CPacket&) () from ../libs/libsrt.so.1.5 #4 0x00007f954ba60156 in srt::CRcvQueue::worker_ProcessConnectionRequest(srt::CUnit*, srt::sockaddr_any const&) () from ../libs/libsrt.so.1.5 #5 0x00007f954ba60ec4 in srt::CRcvQueue::worker(void*) () from ../libs/libsrt.so.1.5 #6 0x00007f954a5ddefc in start_thread (arg=<optimized out>) at pthread_create.c:479 #7 0x00007f9549ec122f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

To Reproduce
Steps to reproduce the behavior:

  1. No problem in SRT versions 1.5.1 or 1.5.2 (I didn't check 1.5.2) But, I found the issue in version 1.5.3.

  2. My setup has 70 SRT sockets (Senders - listeners), setting the SRT latency to 6000ms.
    The trouble starts in the Sender-Listener device right after the downstream device (Receiver-Caller) restarts or loses connection.

Desktop

  • OS: Linux
  • SRT Version / 1.5.3:
@tvaknin tvaknin added the Type: Bug Indicates an unexpected problem or unintended behavior label Apr 25, 2024
@tvaknin tvaknin changed the title [BUG] [BUG] A thread is stuck waiting to close an SRT socket due to an internal SRT mutex lock. Apr 25, 2024
@maxsharabayko maxsharabayko added this to the v1.5.4 milestone Apr 25, 2024
@maxsharabayko maxsharabayko added the [core] Area: Changes in SRT library core label Apr 25, 2024
@maxsharabayko
Copy link
Collaborator

From the provided stack traces it is not clear who holds the m_GlobControlLock. All the threads are trying to lock it. But there should be a thread that is holding it, and is probably trying to lock some other socket, that has been already locked. My gut feeling is that the other lock is m_LSLock. Maybe there is one more thread trying to close a socket, locking this mutex.
CUDTUnited::checkBrokenSockets() locks m_GlobControlLock and eventually calls CUDTUnited::removeSocket(..) -> CUDT::closeInternal() -> m_pRcvQueue->removeListener(this); -> lock m_LSLock. For example here the order would be the opposite

#2  0x00007f954b9ebfb8 in srt::CUDTUnited::newConnection(..)  -> lock m_GlobControlLock
#3  0x00007f954ba1948d in srt::CUDT::processConnectRequest(srt::sockaddr_any const&, srt::CPacket&)
#4  0x00007f954ba60156 in srt::CRcvQueue::worker_ProcessConnectionRequest(..) -> lock m_LSLock
#5  0x00007f954ba60ec4 in srt::CRcvQueue::worker(void*) () from ../libs/libsrt.so.1.5 

But from the state of the GC thread that above described case is not the one happening here.

#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
#1  0x00007f6ad3751411 in __GI___pthread_mutex_lock (mutex=0x7f6ad4c1e480 <srt::CUDT::uglobal()::instance+96>)
#2  0x00007f6ad4b53b33 in srt::CUDTUnited::locateSocket(int, srt::CUDTUnited::ErrorHandling) -> lock m_GlobControlLock
#3  0x00007f6ad4b5bc58 in srt::CUDTUnited::close(int) () from ../libs/libsrt.so.1.5
#4  0x00007f6ad4b5bcec in srt::CUDT::close(int) () from ../libs/libsrt.so.1.5
#5  0x00000000004c1e78 in SrtConnectionManager::CheckForConnection (this=0x10ecea0 <SrtConnectionManager::GetInstance()::inst>)
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
#1  0x00007f954a5e0411 in __GI___pthread_mutex_lock (mutex=0x7f954baad480 <srt::CUDT::uglobal()::instance+96>)
#2  0x00007f954b9e2b33 in srt::CUDTUnited::locateSocket(int, srt::CUDTUnited::ErrorHandling) -> lock m_GlobControlLock
#3  0x00007f954b9e623c in srt::CUDT::sendmsg2(int, char const*, int, SRT_MsgCtrl_&) ()
#4  0x00007f954b9e629f in srt::CUDT::send(int, char const*, int, int)
#5  0x00000000004dad3d in OutSrtSocket::Send (successCnt=0x2f8d484, this=0x7f950194eeb0)
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103 
#1  0x00007f954a5e0411 in __GI___pthread_mutex_lock (mutex=0x7f954baad480 <srt::CUDT::uglobal()::instance+96>)
#2  0x00007f954b9ed00e in srt::CUDTUnited::checkBrokenSockets() () from ../libs/libsrt.so.1.5  -> lock m_GlobControlLock
#3  0x00007f954b9ed6a8 in srt::CUDTUnited::garbageCollect(void*) () -> lock m_GCStopLock
#4  0x00007f954a5ddefc in start_thread (arg=<optimized out>) at pthread_create.c:479 
#5  0x00007f9549ec122f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103 
#1  0x00007f954a5e0411 in __GI___pthread_mutex_lock (mutex=0x7f954baad480 <srt::CUDT::uglobal()::instance+96>)
#2  0x00007f954b9ebfb8 in srt::CUDTUnited::newConnection(..)  -> lock m_GlobControlLock
#3  0x00007f954ba1948d in srt::CUDT::processConnectRequest(srt::sockaddr_any const&, srt::CPacket&)
#4  0x00007f954ba60156 in srt::CRcvQueue::worker_ProcessConnectionRequest(..) -> lock m_LSLock
#5  0x00007f954ba60ec4 in srt::CRcvQueue::worker(void*) () from ../libs/libsrt.so.1.5 
#6  0x00007f954a5ddefc in start_thread (arg=<optimized out>) at pthread_create.c:479 
#7  0x00007f9549ec122f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@ethouris
Copy link
Collaborator

ethouris commented May 7, 2024

Could this problem be reproduced with having #2893 added?

@maxsharabayko
Copy link
Collaborator

Thread sanitizer shows a potential deadlock around those functions:

  1. GC thread violates the mutex lock order by locking m_GlobControlLock (checkBrokenSockets()) first, and then m_ConnectionLock (CUDT::closeInternal()).
  2. The CRcvQueue::worker thread locks m_LSLock first, then m_GlobControlLock. The order also does not look correct.
  3. The CUDTUnited::listen function locks m_ConnectionLock, then m_LSLock.

Full Threat Report

Click to expand
==================
WARNING: ThreadSanitizer: lock-order-inversion (potential deadlock) (pid=2531)
  Cycle in lock order graph: M0 (0x7ba800005160) => M1 (0x7b4800000048) => M2 (0x5594096b2408) => M0
  M0 m_ConnectionLock
  M1 m_LSLock
  M2 m_GlobControlLock

  Mutex M1 acquired here while holding mutex M0 in main thread: m_ConnectionLock -> m_LSLock
    #0 pthread_mutex_lock ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:4481 (libtsan.so.2+0x4ea9f) (BuildId: 5e151fac359cd43a07192270fb85f74e380f2fc8)
    #1 __gthread_mutex_lock /usr/include/x86_64-linux-gnu/c++/12/bits/gthr-default.h:749 (srt-xtransmit+0xded4f) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #2 std::mutex::lock() /usr/include/c++/12/bits/std_mutex.h:100 (srt-xtransmit+0xdedd8) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #3 std::lock_guard<std::mutex>::lock_guard(std::mutex&) /usr/include/c++/12/bits/std_mutex.h:229 (srt-xtransmit+0xdfcfa) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #4 srt::CRcvQueue::setListener(srt::CUDT*) /srtcore/queue.cpp:1693 (srt-xtransmit+0x3277ad) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #5 srt::CUDT::setListenState() /srtcore/core.cpp:1018 (srt-xtransmit+0x29a94c) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #6 srt::CUDTUnited::listen(int, int) /srtcore/api.cpp:1012 (srt-xtransmit+0x25254f) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #7 srt::CUDT::listen(int, int) /srtcore/api.cpp:3571 (srt-xtransmit+0x25aea4) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #8 srt_listen /srtcore/srt_c_api.cpp:114 (srt-xtransmit+0x33f0bd) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #9 xtransmit::socket::srt::listen() /mnt/d/Projects/srt/srt-xtransmit/xtransmit/srt_socket.cpp:99 (srt-xtransmit+0x10b9a1) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #10 xtransmit::create_connection(std::vector<UriParser, std::allocator<UriParser> > const&, std::shared_ptr<xtransmit::socket::isocket>&) /mnt/d/Projects/srt/srt-xtransmit/xtransmit/misc.cpp:63 (srt-xtransmit+0xd5698) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #11 xtransmit::common_run(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, xtransmit::stats_config const&, bool, bool, std::atomic<bool> const&, std::function<void (std::shared_ptr<xtransmit::socket::isocket>, std::atomic<bool> const&)>&) /mnt/d/Projects/srt/srt-xtransmit/xtransmit/misc.cpp:132 (srt-xtransmit+0xd5c98) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #12 xtransmit::receive::run(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, xtransmit::receive::config const&, std::atomic<bool> const&) /mnt/d/Projects/srt/srt-xtransmit/xtransmit/receive.cpp:179 (srt-xtransmit+0xdd471) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #13 main /mnt/d/Projects/srt/srt-xtransmit/xtransmit/xtransmit-app.cpp:231 (srt-xtransmit+0x142732) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)

    Hint: use TSAN_OPTIONS=second_deadlock_stack=1 to get more informative warning message

  Mutex M2 acquired here while holding mutex M1 in thread T3: m_LSLock -> m_GlobControlLock
    #0 pthread_mutex_lock ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:4481 (libtsan.so.2+0x4ea9f) (BuildId: 5e151fac359cd43a07192270fb85f74e380f2fc8)
    #1 __gthread_mutex_lock /usr/include/x86_64-linux-gnu/c++/12/bits/gthr-default.h:749 (srt-xtransmit+0xded4f) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #2 std::mutex::lock() /usr/include/c++/12/bits/std_mutex.h:100 (srt-xtransmit+0xdedd8) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #3 std::lock_guard<std::mutex>::lock_guard(std::mutex&) /usr/include/c++/12/bits/std_mutex.h:229 (srt-xtransmit+0xdfcfa) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #4 srt::CUDTUnited::locateSocket(int, srt::CUDTUnited::ErrorHandling) /srtcore/api.cpp:2488 (srt-xtransmit+0x24e1d6) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #5 srt::CUDTUnited::newConnection(int, srt::sockaddr_any const&, srt::CPacket const&, srt::CHandShake&, int&, srt::CUDT*&) /srtcore/api.cpp:486 (srt-xtransmit+0x24b8c4) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #6 srt::CUDT::processConnectRequest(srt::sockaddr_any const&, srt::CPacket&) /srtcore/core.cpp:11101 (srt-xtransmit+0x2cfe39) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #7 srt::CRcvQueue::worker_ProcessConnectionRequest(srt::CUnit*, srt::sockaddr_any const&) /srtcore/queue.cpp:1411 (srt-xtransmit+0x325f6e) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #8 srt::CRcvQueue::worker(void*) /srtcore/queue.cpp:1247 (srt-xtransmit+0x324fa5) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #9 void* std::__invoke_impl<void*, void* (*)(void*), void*>(std::__invoke_other, void* (*&&)(void*), void*&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/invoke.h:61 (srt-xtransmit+0x3444aa) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #10 std::__invoke_result<void* (*)(void*), void*>::type std::__invoke<void* (*)(void*), void*>(void* (*&&)(void*), void*&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/invoke.h:96 (srt-xtransmit+0x34435a) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #11 void* std::thread::_Invoker<std::tuple<void* (*)(void*), void*> >::_M_invoke<0ul, 1ul>(std::_Index_tuple<0ul, 1ul>) /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/std_thread.h:279 (srt-xtransmit+0x3442e2) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #12 std::thread::_Invoker<std::tuple<void* (*)(void*), void*> >::operator()() /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/std_thread.h:286 (srt-xtransmit+0x344265) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #13 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void* (*)(void*), void*> > >::_M_run() /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/std_thread.h:231 (srt-xtransmit+0x343fa9) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #14 <null> <null> (libstdc++.so.6+0xe62b2) (BuildId: 2db998bd67acbfb235c464c0275d4070061695fb)

  Mutex M0 acquired here while holding mutex M2 in thread T1: m_GlobControlLock -> m_ConnectionLock
    #0 pthread_mutex_lock ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:4481 (libtsan.so.2+0x4ea9f) (BuildId: 5e151fac359cd43a07192270fb85f74e380f2fc8)
    #1 __gthread_mutex_lock /usr/include/x86_64-linux-gnu/c++/12/bits/gthr-default.h:749 (srt-xtransmit+0xded4f) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #2 std::mutex::lock() /usr/include/c++/12/bits/std_mutex.h:100 (srt-xtransmit+0xdedd8) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #3 std::lock_guard<std::mutex>::lock_guard(std::mutex&) /usr/include/c++/12/bits/std_mutex.h:229 (srt-xtransmit+0xdfcfa) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #4 srt::CUDT::closeInternal() srtcore/core.cpp:6255 locks m_ConnectionLock
    #5 srt::CUDTUnited::removeSocket(int) /srtcore/api.cpp:2789 (srt-xtransmit+0x259389) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #6 srt::CUDTUnited::checkBrokenSockets() srtcore/api.cpp:2705 locks m_GlobControlLock
    #7 srt::CUDTUnited::garbageCollect(void*) /srtcore/api.cpp:3339 (srt-xtransmit+0x24abb2) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #8 void* std::__invoke_impl<void*, void* (*)(void*), void*>(std::__invoke_other, void* (*&&)(void*), void*&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/invoke.h:61 (srt-xtransmit+0x3444aa) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #9 std::__invoke_result<void* (*)(void*), void*>::type std::__invoke<void* (*)(void*), void*>(void* (*&&)(void*), void*&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/invoke.h:96 (srt-xtransmit+0x34435a) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #10 void* std::thread::_Invoker<std::tuple<void* (*)(void*), void*> >::_M_invoke<0ul, 1ul>(std::_Index_tuple<0ul, 1ul>) /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/std_thread.h:279 (srt-xtransmit+0x3442e2) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #11 std::thread::_Invoker<std::tuple<void* (*)(void*), void*> >::operator()() /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/std_thread.h:286 (srt-xtransmit+0x344265) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #12 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void* (*)(void*), void*> > >::_M_run() /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/std_thread.h:231 (srt-xtransmit+0x343fa9) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #13 <null> <null> (libstdc++.so.6+0xe62b2) (BuildId: 2db998bd67acbfb235c464c0275d4070061695fb)

  Thread T3 'SRT:RcvQ:w1' (tid=2535, running) created by main thread at:
    #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1036 (libtsan.so.2+0x3d179) (BuildId: 5e151fac359cd43a07192270fb85f74e380f2fc8)
    #1 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) <null> (libstdc++.so.6+0xe6388) (BuildId: 2db998bd67acbfb235c464c0275d4070061695fb)
    #2 srt::sync::StartThread(std::thread&, void* (&)(void*), void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /srtcore/sync.cpp:91 (srt-xtransmit+0x342387) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #3 srt::CRcvQueue::init(int, unsigned long, int, int, srt::CChannel*, srt::sync::CTimer*) /srtcore/queue.cpp:1199 (srt-xtransmit+0x324c8c) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #4 srt::CUDTUnited::updateMux(srt::CUDTSocket*, srt::sockaddr_any const&, int const*) /srtcore/api.cpp:3163 (srt-xtransmit+0x251b7c) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #5 srt::CUDTUnited::bind(srt::CUDTSocket*, srt::sockaddr_any const&) /srtcore/api.cpp:935 (srt-xtransmit+0x24f393) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #6 srt::CUDT::bind(int, sockaddr const*, int) /srtcore/api.cpp:3525 (srt-xtransmit+0x25a777) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #7 srt_bind /srtcore/srt_c_api.cpp:112 (srt-xtransmit+0x33effd) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #8 operator() /mnt/d/Projects/srt/srt-xtransmit/xtransmit/srt_socket.cpp:433 (srt-xtransmit+0x10e8d4) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #9 xtransmit::socket::srt::handle_hosts() /mnt/d/Projects/srt/srt-xtransmit/xtransmit/srt_socket.cpp:493 (srt-xtransmit+0x10efe8) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #10 xtransmit::socket::srt::srt(UriParser const&) /mnt/d/Projects/srt/srt-xtransmit/xtransmit/srt_socket.cpp:67 (srt-xtransmit+0x10b254) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #11 void std::_Construct<xtransmit::socket::srt, UriParser const&>(xtransmit::socket::srt*, UriParser const&) <null> (srt-xtransmit+0xdb0f1) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #12 void std::allocator_traits<std::allocator<void> >::construct<xtransmit::socket::srt, UriParser const&>(std::allocator<void>&, xtransmit::socket::srt*, UriParser const&) <null> (srt-xtransmit+0xdaafb) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #13 std::_Sp_counted_ptr_inplace<xtransmit::socket::srt, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<UriParser const&>(std::allocator<void>, UriParser const&) <null> (srt-xtransmit+0xda2bd) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #14 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<xtransmit::socket::srt, std::allocator<void>, UriParser const&>(xtransmit::socket::srt*&, std::_Sp_alloc_shared_tag<std::allocator<void> >, UriParser const&) /usr/include/c++/12/bits/shared_ptr_base.h:971 (srt-xtransmit+0xd9832) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #15 std::__shared_ptr<xtransmit::socket::srt, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<void>, UriParser const&>(std::_Sp_alloc_shared_tag<std::allocator<void> >, UriParser const&) /usr/include/c++/12/bits/shared_ptr_base.h:1712 (srt-xtransmit+0xd8be9) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #16 std::shared_ptr<xtransmit::socket::srt>::shared_ptr<std::allocator<void>, UriParser const&>(std::_Sp_alloc_shared_tag<std::allocator<void> >, UriParser const&) /usr/include/c++/12/bits/shared_ptr.h:464 (srt-xtransmit+0xd7e38) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #17 std::shared_ptr<std::enable_if<!std::is_array<xtransmit::socket::srt>::value, xtransmit::socket::srt>::type> std::make_shared<xtransmit::socket::srt, UriParser const&>(UriParser const&) <null> (srt-xtransmit+0xd71db) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #18 xtransmit::create_connection(std::vector<UriParser, std::allocator<UriParser> > const&, std::shared_ptr<xtransmit::socket::isocket>&) /mnt/d/Projects/srt/srt-xtransmit/xtransmit/misc.cpp:59 (srt-xtransmit+0xd55f9) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #19 xtransmit::common_run(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, xtransmit::stats_config const&, bool, bool, std::atomic<bool> const&, std::function<void (std::shared_ptr<xtransmit::socket::isocket>, std::atomic<bool> const&)>&) /mnt/d/Projects/srt/srt-xtransmit/xtransmit/misc.cpp:132 (srt-xtransmit+0xd5c98) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #20 xtransmit::receive::run(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, xtransmit::receive::config const&, std::atomic<bool> const&) /mnt/d/Projects/srt/srt-xtransmit/xtransmit/receive.cpp:179 (srt-xtransmit+0xdd471) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #21 main /mnt/d/Projects/srt/srt-xtransmit/xtransmit/xtransmit-app.cpp:231 (srt-xtransmit+0x142732) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)

  Thread T1 (tid=2533, running) created by main thread at:
    #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1036 (libtsan.so.2+0x3d179) (BuildId: 5e151fac359cd43a07192270fb85f74e380f2fc8)
    #1 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) <null> (libstdc++.so.6+0xe6388) (BuildId: 2db998bd67acbfb235c464c0275d4070061695fb)
    #2 srt::sync::StartThread(std::thread&, void* (&)(void*), void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /srtcore/sync.cpp:91 (srt-xtransmit+0x342387) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #3 srt::CUDTUnited::startup() /srtcore/api.cpp:266 (srt-xtransmit+0x24a654) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #4 srt::CUDT::startup() /srtcore/api.cpp:3360 (srt-xtransmit+0x259c27) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #5 srt_startup /srtcore/srt_c_api.cpp:32 (srt-xtransmit+0x33eb55) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #6 NetworkInit::NetworkInit() /mnt/d/Projects/srt/srt-xtransmit/xtransmit/xtransmit-app.cpp:48 (srt-xtransmit+0x15340c) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)
    #7 main /mnt/d/Projects/srt/srt-xtransmit/xtransmit/xtransmit-app.cpp:220 (srt-xtransmit+0x1426c2) (BuildId: 61e8e3fcf6378d7f53d4d27765628463dfc8bed6)

SUMMARY: ThreadSanitizer: lock-order-inversion (potential deadlock) /usr/include/x86_64-linux-gnu/c++/12/bits/gthr-default.h:749 in __gthread_mutex_lock
==================

@ethouris
Copy link
Collaborator

ethouris commented May 7, 2024

There was some PR where m_ConnectionLock was taken out for the call to closeInternal(), but I don't remember the details.

The inversion between m_LSLock and m_ConnectionLock is known, but it was later proven that these two things can never happen simultaneously - it's just detected by the thread sanitizer that these two cases may happen in theory. It would be nice to dissolve them, but no sensible way to fix it was found.

@tvaknin
Copy link
Author

tvaknin commented May 8, 2024

Could this problem be reproduced with having #2893 added?

I can try this next week

@tvaknin
Copy link
Author

tvaknin commented May 15, 2024

I tried the fix using the branch "dev-add-socket-busy-counter". I can reproduce the same problem. let me know if you need more info.

@maxsharabayko
Copy link
Collaborator

We need somehow to identify who is holding the mutex (see my comment above). Can you check other threads? There should be some trying to lock another mutex while m_GlobControlLock is locked.

@ethouris
Copy link
Collaborator

You may try the fix in #2032. I can't find any PR with that fix which does unlocking for the time of closeInternal().

@tvaknin
Copy link
Author

tvaknin commented May 16, 2024

We need somehow to identify who is holding the mutex (see my comment above). Can you check other threads? There should be some trying to lock another mutex while m_GlobControlLock is locked.

Attached full BT of all threads.

np_backtrace.log

@maxsharabayko
Copy link
Collaborator

maxsharabayko commented May 17, 2024

I see 7 threads calling srt_sendmsg and trying to lock m_GlobControlLock to lockate the socket, one GC thread, and one thread calling getsockstate. One srt_sendmsg thread (Thread 26) is releasing the mutex, so it does not look like a deadlock, just too many threads waiting for m_GlobControlLock.
It feels like the m_GlobControlLock should at least become a shared mutex to simplify access without modification.
See also #2393.

This issue is likely similar to #2252.

@ethouris
Copy link
Collaborator

Just took a quick look - m_GlobControlLock occurs in this log 10 times, there are many threads that are trying to lock this mutex and one - OutWorker_4 - is at this moment unlocking the mutex. Definitely not this mutex is a reason of a deadlock. Must be something else, likely one of the CVs.

@ethouris
Copy link
Collaborator

And the only other thing I can see where any of the SRT threads is standing is CSndUList::waitNonEmpty() call, which in general is waiting until there's anything scheduled for sending over the channel, for which this worker is running. Should that be a point of deadlock by some reason, we might add a possibility of waiting for a limited time and on timeout we can check if the condition is somehow satisfied to detect waiting failures (that waiting part is likely the unchanged legacy code and I have doubts as to whether it should be done this way because the first condition check should be done AFTER LOCKING, and only then if not satisfied the waiting function should be executed).

@maxsharabayko
Copy link
Collaborator

CSndUList::waitNonEmpty() gets interrupted either on a new socket added in the heap, or if the CSndQueue is being destroyed.
@ethouris you suggest it may be causing a deadlock in the situation when m_bClosing is being set to true right after the check

if (!self->m_bClosing)
{
    self->m_pSndUList->waitNonEmpty();
    IF_DEBUG_HIGHRATE(self->m_WorkerStats.lCondWait++);
}

while this notification is also being missed

srt::CSndQueue::~CSndQueue()
{
    m_bClosing = true;

    if (m_pTimer != NULL)
    {
        m_pTimer->interrupt();
    }

    // Unblock CSndQueue worker thread if it is waiting.
    m_pSndUList->signalInterrupt();

@tvaknin
Copy link
Author

tvaknin commented Jun 2, 2024

Hi, let me know if I can help from my side or if some info is needed.

@maxsharabayko
Copy link
Collaborator

@tvaknin
We still only speculate on the possible issue here. Backtraces do not show a deadlock: there is a thread releasing the mutex.
Is it a deadlock or #2252?

@tvaknin
Copy link
Author

tvaknin commented Jun 3, 2024

@maxsharabayko Can you confirm srt_sendmsg uses the same mutex that is used for configuration? - because my case is that the neptonManager thread is stuck forever.

@ethouris
Copy link
Collaborator

ethouris commented Jun 3, 2024

What do you mean by "for configuration"?

@tvaknin
Copy link
Author

tvaknin commented Jun 3, 2024

For example, look at this bt:

A call to the SRT API of srt_getsockstate is used (I called it configuration - ignore this name):

Thread 39 (Thread 0x7f3876ffd700 (LWP 46389) "neptonManager"): #0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103 #1 0x00007f38c168f411 in __GI___pthread_mutex_lock (mutex=0x7f38c2b5c480 <srt::CUDT::uglobal()::instance+96>) at ../nptl/pthread_mutex_lock.c:80 #2 0x00007f38c2a9168e in srt::CUDTUnited::getStatus(int) () from /opt/erola/smu/Packagenepton/../libs/libsrt.so.1.5 #3 0x00007f38c2a95a0f in srt::CUDT::getsockstate(int) () from /opt/erola/smu/Packagenepton/../libs/libsrt.so.1.5 #4 0x00000000004c203f in SrtConnectionManager::CheckForConnection (this=0x10ecea0 <SrtConnectionManager::GetInstance()::inst>) at muxer/SrtConnectionManager.cpp:201 #5 0x000000000043c901 in Muxer::UpdateSrtStatus (this=<optimized out>) at muxer/Muxer.cpp:331 #6 0x0000000000474a85 in neptonManager::RunProtobufServer (this=this@entry=0x7f38bc538010) at manager/neptonManager.cpp:509 #7 0x00000000004751bb in neptonManager::run (this=0x7f38bc538010) at /opt/omneon/cross_x86_64_v10d/include/c++/12.2.0/bits/basic_string.h:233 #8 0x0000000000b5412f in blue::kernel::Thread::platform_info::threadFunction (data=0x7f38bc538010) at Src/Blue/Kernel/Thread.cpp:173 #9 0x00007f38c168cefc in start_thread (arg=<optimized out>) at pthread_create.c:479 #10 0x00007f38c0f7022f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Versus "srt_sendmsg" calls.

@maxsharabayko
Copy link
Collaborator

maxsharabayko commented Jun 5, 2024

@tvaknin

@maxsharabayko Can you confirm srt_sendmsg uses the same mutex that is used for configuration? - because my case is that the neptonManager thread is stuck forever.

From your last backtrace the neptonManager thread 39 is waiting to acquire the m_GlobControlLock:

#1  0x00007f38c168f411 in __GI___pthread_mutex_lock (mutex=0x7f38c2b5c480 <srt::CUDT::uglobal()::instance+96>) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007f38c2a9168e in srt::CUDTUnited::getStatus(int) () from /opt/erola/smu/Packagenepton/../libs/libsrt.so.1.5
#3  0x00007f38c2a95a0f in srt::CUDT::getsockstate(int) () from /opt/erola/smu/Packagenepton/../libs/libsrt.so.1.5

Thread 26 is just about to release it:

#0  __lll_unlock_wake () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:339
#1  0x00007f38c169095e in __pthread_mutex_unlock_usercnt (decr=1, mutex=0x7f38c2b5c480 <srt::CUDT::uglobal()::instance+96>) at pthread_mutex_unlock.c:56
#2  __GI___pthread_mutex_unlock (mutex=0x7f38c2b5c480 <srt::CUDT::uglobal()::instance+96>) at pthread_mutex_unlock.c:356
#3  0x00007f38c2b2e9bc in srt::sync::ScopedLock::~ScopedLock() () from /opt/erola/smu/Packagenepton/../libs/libsrt.so.1.5
#4  0x00007f38c2a77ff1 in srt::CUDTUnited::locateSocket(int, srt::CUDTUnited::ErrorHandling) [clone .cold] () from /opt/erola/smu/Packagenepton/../libs/libsrt.so.1.5
#5  0x00007f38c2a9523c in srt::CUDT::sendmsg2(int, char const*, int, SRT_MsgCtrl_&) () from /opt/erola/smu/Packagenepton/../libs/libsrt.so.1.5

But there are 7 other sendmsg threads waiting to acquire the m_GlobControlLock as well, so the "neptonManager" thread may not have a chance to acquire it in a reasonable time.

If it is the case, then the problem is in m_GlobControlLock taking too much responsibility (#2393). The logic must be redesigned.

@tvaknin
Copy link
Author

tvaknin commented Jun 5, 2024

Understood. I see that the neptonManager thread 39 is stuck waiting to acquire the m_GlobControlLock indefinitely, or at least for the several minutes I observed during debugging. Could this be due to a few threads sending packets (sendmsg) and holding the m_GlobControlLock for too long?

By changing the logic, do you mean modifying how the SRT library uses m_GlobControlLock and making the changes described in this issue: (#2393)?

@maxsharabayko
Copy link
Collaborator

m_GlobControlLock guards all containers in CUDTUnited. One of the first steps could be changing its type from a regular mutex to a readers-writer lock. Then several threads could access containers for reading simultaneously.
We can't use the std::shared_mutex directly because we have to support C++03. A wrapper has to be added to the srt::sync module.

@maxsharabayko maxsharabayko modified the milestones: v1.5.4, v1.6.0 Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[core] Area: Changes in SRT library core Type: Bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants