Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The UpdaterService test is flaky #46255

Closed
jaredpar opened this issue Jul 23, 2020 · 7 comments · Fixed by #46346
Closed

The UpdaterService test is flaky #46255

jaredpar opened this issue Jul 23, 2020 · 7 comments · Fixed by #46346
Assignees
Labels
Area-IDE Disabled Test Flaky Resolution-Fixed The bug has been fixed and/or the requested behavior has been implemented Urgency-Now
Milestone

Comments

@jaredpar
Copy link
Member

Builds

Build Pull Request Test Failure Count
#741836 Rolling 1
#742047 #46210 1
#742073 #46229 1
#742078 Rolling 1
#742108 #46233 1
#742242 Rolling 2
#742430 #46234 1
#742466 #46026 1
#742550 Rolling 1
#742551 Rolling 1
#742560 #44447 1
#742614 #45348 2
#742716 #45348 1
#742765 Rolling 1
#742784 #45993 2
#743055 #44447 1

Configurations

  • Windows Desktop Debug Test32
  • Windows Desktop Debug Test64
  • Windows Desktop Release Test32
  • Windows Desktop Release Test64
  • Windows Desktop Spanish
@jaredpar jaredpar added this to the 16.8 milestone Jul 23, 2020
@jaredpar
Copy link
Member Author

@sharwell, @JoeRobich fyi

@sharwell
Copy link
Member

sharwell commented Jul 23, 2020

I'm fairly sure that these issues have the same root cause:

#45877
#46230
#46255 (this issue)

So far I have not been able to find a root cause.

@333fred
Copy link
Member

333fred commented Jul 24, 2020

This has failed 42 times in the last 14 days and has a pass rate of 95.66%.

@sharwell
Copy link
Member

@tmat This test is being disabled in #46310 so it needs to be fixed ASAP

@sharwell
Copy link
Member

This failure is caused by a race condition between four operations:

  1. The "Update" operation, where _workspace.CurrentSolution is changed
  2. The "Enqueue" operation, where SolutionChecksumUpdater handles the WorkspaceChanged event
  3. The "Wait" operation, where SolutionChecksumUpdater waits for work
  4. The "Execute" operation, where SolutionChecksumUpdater processes work

I derived a sequence for deterministic failure by working back from the following observations:

  1. The call to WaitAsync in IdleProcessor.ProcessAsync is not located within an async operation. Specifically there is an unprotected region of code between these two lines:

    await WaitAsync(CancellationToken).ConfigureAwait(false);
    using (Listener.BeginAsyncOperation("ProcessAsync"))

  2. The "Execute" operation always calls _workspace.CurrentSolution.State.GetChecksumAsync, which means the final "Execute" operation in the failed sequence completed before the last "Update" operation.

From (2), we know that the fourth "Enqueue" operation completed before the third "Wait" operation (i.e. the fourth enqueue operation returned early from the _event.CurrentCount check. We also know that the third "Execute" operation was never performed, which means the third "Enqueue" operation failed to assign a new token to _lastToken.

This leads to the following sequence:

  1. Second "Wait" operation (ensures _event.CurrentCount is 0 and _lastToken is not null)
  2. Third "Enqueue" operation (requires _event.CurrentCount is 0, ensures _lastToken is not null and _event.CurrentCount is 1)
  3. Second "Execute" operation (requires _lastToken is not null, ensures _lastToken is null)
  4. Fourth "Update" operation (ensures _workspace.CurrentSolution.State.GetChecksumAsync has not been called)
  5. Fourth "Enqueue" operation (requires _event.CurrentCount is 1, takes no action)
  6. Third "Wait" operation (requires _lastToken is null)

During the final step of this sequence, there is no pending work for either the Workspace or SolutionChecksumUpdater waiters, so the test proceeds to completion without reaching the required call to GetChecksumAsync.

@CyrusNajmabadi
Copy link
Member

WHat are you feelings on just burning this to the ground and replacing with something much simpler, versus patching in place like you've done? This seems liek a super complex beast for what is conceptually such a simple component (i.e. a throttled, pausable, sequence of work).

@sharwell
Copy link
Member

I feel confident the PR I submitted resolves the race condition here, and time would be better spent on resolving the other issues from #46255 (comment) than rewriting this.

sharwell added a commit to sharwell/roslyn that referenced this issue Jul 28, 2020
@sharwell sharwell added the Resolution-Fixed The bug has been fixed and/or the requested behavior has been implemented label Jul 29, 2020
@sharwell sharwell assigned sharwell and unassigned tmat Jul 29, 2020
@sharwell sharwell modified the milestones: 16.8, Next Jul 29, 2020
@RikkiGibson RikkiGibson removed this from the Next milestone Aug 11, 2020
@RikkiGibson RikkiGibson added this to the 16.8.P2 milestone Aug 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-IDE Disabled Test Flaky Resolution-Fixed The bug has been fixed and/or the requested behavior has been implemented Urgency-Now
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants