Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for mutex slim being released too early during flush operation #1585

Merged
merged 4 commits into from
Nov 13, 2020

Conversation

arsnyder16
Copy link

@arsnyder16 arsnyder16 commented Oct 14, 2020

Fixes #1438

When flushing during FlushSync we are specifying a timeout to the Task, but when this timeout gets hit the underlying task is still running. The thread that invoked FlushSync continues and releases the MutexSlim that is protecting the io pipe allowing another thread to obtain the MutexSlim and try to run operations on the pipe.

To fix use the cancellation token that can be passed to PipeWriter.FlushAsync.

An alternative to this is to not specify a timeout. There are 2 additional locations that flush the pipe neither of those locations specify a timeout they just simply await the Task. Generally the timeouts are driven by how long an operation waits to obtain the MutexSlim

@arsnyder16
Copy link
Author

Looks like the CI checks are failing for some unrelated issue, seems like the other PRs have the same issue

@mgravell
Copy link
Collaborator

mgravell commented Oct 14, 2020 via email

@arsnyder16
Copy link
Author

Thanks, i would be interested in hearing your thoughts, since i am not overly familiar with the library. Feel free to adjust it in another way, this seemed to make the most sense to me.

@ghost
Copy link

ghost commented Oct 14, 2020

I can't speak to the exact fix, but the described behavior matches exactly the symptoms we've seen provoke the issue.

@ghost
Copy link

ghost commented Oct 14, 2020

Okay, I've walked through the precise fixes. Normally, I'd like to see a continuation bound for on cancel but that's not really the model or style of the lib, I think?

@Plasma
Copy link

Plasma commented Oct 17, 2020

I agree the timeout can cause the lock to be released (incorrectly) while the task is actually making progress in the background. so I think this change is correct.

@NickCraver
Copy link
Collaborator

Merging main in here for updated build checks

@JKurzer
Copy link

JKurzer commented Nov 3, 2020

Do we have a timeline for accepting the PR? This issue continues to block us from rolling our version forward.

@mgravell
Copy link
Collaborator

mgravell commented Nov 4, 2020

Sorry, I had planned to already look at it, but things got ... busy. I hope to look tomorrow (although I appreciate that I've said that before)

@JKurzer
Copy link

JKurzer commented Nov 4, 2020

No worries, we're in a stable-ish state on an older version, and life is pretty complicated for all of us right now.

@mgravell
Copy link
Collaborator

mgravell commented Nov 5, 2020

I concur with the problem, but I'm very dubious of allocating a CTS with scheduled timeout every call; I wonder instead if we could allocate and reuse a CTS, and only schedule the timeout (dooming it for reuse) when we get a non-sync response; something like this: e1e7189

(I recommend viewing without whitespace deltas: e1e7189?w=1)

thoughts? if you agree that this minor tweak makes sense, feel free to cherry-pick or merge it into your branch, and we can get this merged.

A very nuanced spot - congrats for finding this and offering a fix.

@mgravell mgravell merged commit 3d43209 into StackExchange:main Nov 13, 2020
@mgravell
Copy link
Collaborator

merged with the CTS tweak as proposed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Get & Set Failing After Error Cascade
5 participants