-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition in TransferHandle::WaitUntilFinished() deadlocking #511
Conversation
Any thing I can do to help get this merged? |
I will follow up on this with you |
Hi ferrouswheel, Did the deadlock you encountered occur when transfer error occurred? I encountered a deadlock (from time to time), on transfer errors, and suggested a fix here: #520 If your case was on failure, I think my fix is better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good will merge shortly
@Eliyahu-Machluf afaikl the deadlock happened even on success. The file would be available on s3, but our processes would just hang in |
OK. so the fix of #520 is not relevant for the problem you encountered, as it fixes deadlock on transfer error. |
@ferrouswheel After applying your changes, it's still possible to incur deadlock. Thinking about this sequence: A better way is to apply you changes and move lock inside WaitUntilFinished ahead of If condition |
@singku Yes - I see what you mean. I've made this change. |
if (!IsFinishedStatus(static_cast<TransferStatus>(m_status.load())) || HasPendingParts()) | ||
{ | ||
std::unique_lock<std::mutex> semaphoreLock(m_statusLock); | ||
m_waitUntilFinishedSignal.wait(semaphoreLock, [this]() | ||
{ return IsFinishedStatus(static_cast<TransferStatus>(m_status.load())) && !HasPendingParts(); }); | ||
semaphoreLock.unlock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to unlock explicitly? Shouldn't the system unlock it when lock get destructed?
Merged |
I was debugging a rare deadlock condition we would observe on long running jobs processing data from s3. When I attached to the process it appears
TransferHandle::WaitUntilFinished()
would wait indefinitely.From reading the code, it appears possible that
m_waitUntilFinishedSignal.notify_all();
may be called beforem_waitUntilFinishedSignal.wait(...);
if theWaitUntilFinished
thread loses context after the conditional check onm_status.load()
.The fix (I think) is to broaden the scope of the lock to encompass setting the new status.
I'm not sure if the lock in
UpdateStatus
should actually be placed in a block to ensurem_statusLock
is immediate available whennotify_all()
is called?