Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition around supervisor's Commander #291

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

tpaschalis
Copy link
Member

@tpaschalis tpaschalis commented Jul 1, 2024

From what I'd see under normal circumstances (and also the example), starting up the supervisor which also launches the commander would happen in a separate goroutine than the one that inspects its state or calls Stop.

The current implementation has a race condition on the cmd *exec.Cmd field, using it both for figuring out whether the commander is running and also actually executing a new Agent process, which could potentially lead to a panic.

This PR fixes a small race condition around the supervisor's Commander by utilizing an existing running field and turning it into a atomic.Bool to store the commander's state.

Fixes #290

Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>
@tpaschalis tpaschalis requested a review from a team July 1, 2024 08:24
@@ -102,21 +102,21 @@ func (c *Commander) Pid() int {

// ExitCode returns Agent process exit code if it exited or 0 if it is not.
func (c *Commander) ExitCode() int {
if c.cmd == nil || c.cmd.ProcessState == nil {
if c.IsRunning() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I've actually flipped the condition from what it was before, as per the method's docstring.

Copy link
Member

@srikanthccv srikanthccv Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this potentially leave the agent process running in some cases? The NewSupervisor method starts a goroutine runAgentProcess to launch the agent process. There's a possibility that Shutdown might be called before c.cmd.Start() finishes. If c.cmd.Start() completes successfully, the process would keep running, which isn't expected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, let me think about this (and if the previous code prevented that from happening). We could potentially store the running value a little earlier to avoid that IIUC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there, sorry for taking so long to get back to you. As far as I can tell from the code, this was already the case and it isn't a regression from this PR.

To avoid this, we'd make it the caller's responsibility to call only call Stop() after IsRunning() returns true (for example having a backoff mechanism for properly cleaning up resources), or add more explicit synchronization like we do in clientcommon

// True when stopping is in progress.
isStoppingFlag bool
isStoppingMutex sync.RWMutex

IMO this is a separate issue than the race condition the one's fixed here, so tackling the cleanup of resoueces might be a new discussion. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think it was possible to leave the process running in the background because we are checking if c.cmd is non-nil (which is the source of the race condition) instead of running status. I think the Stop should be updated to handle this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay.

Here is the case I think this still doesn't address

  1. Start() is called
  2. The process begins starting up but hasn't yet reached the point where c.running.Store(true) is executed
  3. Shutdown() is called during this window
  4. Since !c.IsRunning() evaluates to true at this point, Shutdown() returns immediately
  5. This could result in the process continuing to run

Does that make sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I've pushed another commit in the Stop method to set the isStoppingFlag immediately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, let me try to get this right.

The commander's Start is called only in two cases
a) In the Supervisor's startAgent.
b) During a restart (not pertinent to our discussion)

The startAgent method is only called via the runAgentProcess; in turn, this is only called as a new goroutine in the NewSupervisor function.

In the Shutdown method, the Commander's Stop sets the isStopping flag.

So let's say that NewSupervisor starts, and the new goroutine is launched.

Immediately, the commander's Shutdown method is called, to call the commander's Stop and set the isStoppingFlag. When the call stack gets to the commander's Start it will immediately exit with the IsStopping method without having a chance to launch any process.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @srikanthccv I've put out a test in f135e3b to try and verify the behavior we want.

I couldn't find any more elegant way than the ugly DelayLogger, but I'd like to see what you think. (This test fails in main, without my fix here).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So let's say that NewSupervisor starts, and the new goroutine is launched.

Immediately, the commander's Shutdown method is called, to call the commander's Stop and set the isStoppingFlag. When the call stack gets to the commander's Start it will immediately exit with the IsStopping method without having a chance to launch any process.

Take this case

NewSupervisor starts, and the goroutine runAgentProcess is launched, -> startAgent -> s.commander.Start -> the execution reaches c.cmd.Start

if err := c.cmd.Start(); err != nil {

And say cmd.Start didn't return yet

Now, assume the commander's Shutdown method is called (c.cmd.Start still didn't return) so c.running is false. In this flow isStoppingFlag is not helping because we would still return as IsRunning evaluates to true.

I think if we add a waitgroup before the c.cmd.Start and wait in shutdown should solve the issue.

Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>
Copy link

codecov bot commented Jul 9, 2024

Codecov Report

Attention: Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.

Project coverage is 77.06%. Comparing base (ed38d5f) to head (f135e3b).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
internal/noplogger.go 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #291      +/-   ##
==========================================
+ Coverage   76.32%   77.06%   +0.74%     
==========================================
  Files          25       25              
  Lines        1681     1696      +15     
==========================================
+ Hits         1283     1307      +24     
+ Misses        291      286       -5     
+ Partials      107      103       -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tigrannajaryan
Copy link
Member

@tpaschalis @srikanthccv is this still in progress?

Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>
@tpaschalis
Copy link
Member Author

@tpaschalis @srikanthccv is this still in progress?

Srikanth had suggested using more explicit synchronization to avoid potentially leaving a dangling process; if the Shutdown method of the supervisor is called, then the isStoppingFlag will be set and the call to the commander's Start will abort before starting the new process.

I'm looking forward to feedback!

Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>
Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>
@tigrannajaryan
Copy link
Member

@tpaschalis @srikanthccv I assume you are working on this PR. Please ping me when it is ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TestNewSupervisor has a flaky race condition
3 participants