Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in TransferHandle::WaitUntilFinished() deadlocking #511

Closed
wants to merge 2 commits into from
Closed

Fix race condition in TransferHandle::WaitUntilFinished() deadlocking #511

wants to merge 2 commits into from

Conversation

ferrouswheel
Copy link

@ferrouswheel ferrouswheel commented Apr 24, 2017

I was debugging a rare deadlock condition we would observe on long running jobs processing data from s3. When I attached to the process it appears TransferHandle::WaitUntilFinished() would wait indefinitely.

From reading the code, it appears possible that m_waitUntilFinishedSignal.notify_all(); may be called before m_waitUntilFinishedSignal.wait(...); if the WaitUntilFinished thread loses context after the conditional check on m_status.load().

The fix (I think) is to broaden the scope of the lock to encompass setting the new status.

I'm not sure if the lock in UpdateStatus should actually be placed in a block to ensure m_statusLock is immediate available when notify_all() is called?

@ferrouswheel
Copy link
Author

Any thing I can do to help get this merged?

@singku
Copy link
Contributor

singku commented May 3, 2017

I will follow up on this with you

@Eliyahu-Machluf
Copy link

Hi ferrouswheel,

Did the deadlock you encountered occur when transfer error occurred? I encountered a deadlock (from time to time), on transfer errors, and suggested a fix here: #520

If your case was on failure, I think my fix is better.

Copy link
Contributor

@JonathanHenson JonathanHenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good will merge shortly

@ferrouswheel
Copy link
Author

@Eliyahu-Machluf afaikl the deadlock happened even on success. The file would be available on s3, but our processes would just hang in WaitUntilFinished()

@Eliyahu-Machluf
Copy link

OK. so the fix of #520 is not relevant for the problem you encountered, as it fixes deadlock on transfer error.

@singku
Copy link
Contributor

singku commented May 11, 2017

@ferrouswheel After applying your changes, it's still possible to incur deadlock. Thinking about this sequence:
Thread A: run line 249 (If condition succeed then ready to run line 251-252, lock & wait)
Thread B: run line 232 to line 242 (update status then lock and notify all)
Thread A: run line 251-252 (lock & wait) , now notification from Thread B has lost.

A better way is to apply you changes and move lock inside WaitUntilFinished ahead of If condition

@ferrouswheel
Copy link
Author

@singku Yes - I see what you mean. I've made this change.

if (!IsFinishedStatus(static_cast<TransferStatus>(m_status.load())) || HasPendingParts())
{
std::unique_lock<std::mutex> semaphoreLock(m_statusLock);
m_waitUntilFinishedSignal.wait(semaphoreLock, [this]()
{ return IsFinishedStatus(static_cast<TransferStatus>(m_status.load())) && !HasPendingParts(); });
semaphoreLock.unlock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to unlock explicitly? Shouldn't the system unlock it when lock get destructed?

JonathanHenson pushed a commit that referenced this pull request May 12, 2017
* Fix race condition in TransferHandle::WaitUntilFinished() deadlocking
#511 #520
@singku
Copy link
Contributor

singku commented May 12, 2017

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants