Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [Microsoft.Azure.ServiceBus] Closing MessageReceiver Does not Always Close inner ReceivingAmqpLink #16994

Closed
paulsavides opened this issue Nov 16, 2020 · 13 comments
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team. Service Bus

Comments

@paulsavides
Copy link

Describe the bug
During periods when a large number of server side errors occur, we sometimes see messages getting "stuck" in queues. As in, they hang in the queues for the configured message lock timeout before being redelivered.

After some attempts to reproduce with a smaller example, I have found that, in certain scenarios, when calling MessageReceiver.CloseAsync(), the inner ReceivingAmqpLink is not actually closed. So, the link is sitting there in the background, continually picking up messages

The only way I was able to get this to reproduce is by closing & opening a new receiver when receiving an error on the ExceptionHandler. My best guess to why this issue occurs; when the inner link faults, it will auto-recover in OnReceiveAsync(). Perhaps there is a race condition with auto recovery & closing the receiver at similar times.

Of course, perhaps there is something completely off with the usage of the sdk here as well.

Expected behavior
In general, I would expect that CloseAsync() would always close the inner ReceivingAmqpLink.

To Reproduce
Reproduction Repo = https://github.com/paulsavides/ServiceBusTesting

ReproProject is the project that reproduces this issue. If the code is doing something extremely incorrect, please let me know. We are actually using the MassTransit library to interact with AzureServiceBus so I had to recreate a bit of what it was doing that reproduces the error.

  1. Open solution from production repo
  2. Set ReproProject as Startup project
  3. Fill in Endpoint & Shared Access Key Signature in Program.cs
  4. Run the project
  5. While the project is running, open the Queue in the Azure UI & continually update the Auto-delete after idle setting.
  6. Eventually, in the console output, you will see errors coming through the exception handler & the receiver will 'recycle' some number of times
  7. After recycle, you should start seeing message sends & receives being mismatched
    • if not, go back to step 5
  8. Press d to print out diagnostics on all of the links from "closed" receivers that are still open & the number of unsettled messages from those links

Environment:

  • Microsoft.Azure.ServiceBus 5.0.0
  • .net sdk 3.1.102, Microsoft.NETCore.App 3.1.9
  • Visual Studio 16.8.1
  • Have verified the issue occurs on AzureServiceBus standard tier, I believe I have seen it on the premium tier as well.

Please let me know if you require any clarification from me.

Thank you for taking the time to look into this,
Paul Savides

@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Nov 16, 2020
@jsquire jsquire added Client This issue points to a problem in the data-plane of the library. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team Service Attention Workflow: This issue is responsible by Azure service team. Service Bus labels Nov 17, 2020
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Nov 17, 2020
@ghost
Copy link

ghost commented Nov 17, 2020

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @axisc.

Issue Details
Description:

Describe the bug
During periods when a large number of server side errors occur, we sometimes see messages getting "stuck" in queues. As in, they hang in the queues for the configured message lock timeout before being redelivered.

After some attempts to reproduce with a smaller example, I have found that, in certain scenarios, when calling MessageReceiver.CloseAsync(), the inner ReceivingAmqpLink is not actually closed. So, the link is sitting there in the background, continually picking up messages

The only way I was able to get this to reproduce is by closing & opening a new receiver when receiving an error on the ExceptionHandler. My best guess to why this issue occurs; when the inner link faults, it will auto-recover in OnReceiveAsync(). Perhaps there is a race condition with auto recovery & closing the receiver at similar times.

Of course, perhaps there is something completely off with the usage of the sdk here as well.

Expected behavior
In general, I would expect that CloseAsync() would always close the inner ReceivingAmqpLink.

To Reproduce
Reproduction Repo = https://github.com/paulsavides/ServiceBusTesting

ReproProject is the project that reproduces this issue. If the code is doing something extremely incorrect, please let me know. We are actually using the MassTransit library to interact with AzureServiceBus so I had to recreate a bit of what it was doing that reproduces the error.

  1. Open solution from production repo
  2. Set ReproProject as Startup project
  3. Fill in Endpoint & Shared Access Key Signature in Program.cs
  4. Run the project
  5. While the project is running, open the Queue in the Azure UI & continually update the Auto-delete after idle setting.
  6. Eventually, in the console output, you will see errors coming through the exception handler & the receiver will 'recycle' some number of times
  7. After recycle, you should start seeing message sends & receives being mismatched
    • if not, go back to step 5
  8. Press d to print out diagnostics on all of the links from "closed" receivers that are still open & the number of unsettled messages from those links

Environment:

  • Microsoft.Azure.ServiceBus 5.0.0
  • .net sdk 3.1.102, Microsoft.NETCore.App 3.1.9
  • Visual Studio 16.8.1
  • Have verified the issue occurs on AzureServiceBus standard tier, I believe I have seen it on the premium tier as well.

Please let me know if you require any clarification from me.

Thank you for taking the time to look into this,
Paul Savides

Author: paulsavides
Assignees: -
Labels:

Client, Service Attention, Service Bus, customer-reported, needs-team-attention, needs-triage, question

Milestone: -

@jsquire
Copy link
Member

jsquire commented Nov 17, 2020

Thank you for your feedback. Tagging and routing to the team best able to assist.

@jsquire
Copy link
Member

jsquire commented Nov 17, 2020

//fyi: @JoshLove-msft

@paulsavides
Copy link
Author

paulsavides commented Nov 19, 2020

I highly suspect this issue is due to the Microsoft.Azure.Amqp Singleton (and thus the FaultTolerantAmqpObject) class not properly handling concurrent closes and openings.

In my linked repo you can see the (failing) test here >> https://github.com/paulsavides/ServiceBusTesting/blob/4cf80ba2f8e2d725b2d1923600d4aba3f26ee9b3/InterestingTests/SingletonTests.cs#L12-L38

Although the singleton is closed & subsequent GetOrCreateAsync calls will fail, the internal value is not closed.

In the usage here, I believe the following two sections are being run in such an order to cause the issue I've replicated in that unit test.

During periods of transient errors, this section may enter the workflow to recreate the message receiver.

if(!this.ReceiveLinkManager.TryGetOpenedObject(out receiveLink))
{
MessagingEventSource.Log.CreatingNewLink(this.ClientId, this.isSessionReceiver, this.SessionIdInternal, false, this.LinkException);
receiveLink = await this.ReceiveLinkManager.GetOrCreateAsync(timeoutHelper.RemainingTime()).ConfigureAwait(false);
}

If the receiver.CloseAsync() is called after entering the callback here >>

async Task<ReceivingAmqpLink> CreateLinkAsync(TimeSpan timeout)

It seems like it would be possible to end up with an orphaned receiver.

@JoshLove-msft
Copy link
Member

/cc @xinchen10 to confirm the issue in the AMQP library.

@paulsavides
Copy link
Author

Hello, after upgrading my reproduction projects to Microsoft.Azure.Amqp 2.4.9 I could no longer reproduce the issue. See https://github.com/Azure/azure-amqp/releases/tag/v2.4.9

Upgrading that library should be enough to close this issue I believe.

@axisc
Copy link

axisc commented Dec 2, 2020

@paulsavides Thanks for confirming this. I'm closing the issue for now.

@axisc axisc closed this as completed Dec 2, 2020
@paulsavides
Copy link
Author

@axisc as is, the Microsoft.Azure.ServiceBus package should still exhibit the issue. Updating the dependent version of Microsoft.Azure.Amqp to 2.4.9 in the azure-sdk would make sure it was fixed in here as well.

@JoshLove-msft
Copy link
Member

Updating the repo dependency to 2.4.9 in #17290. Whenever the Service Bus library is next released it will contain the updated dependency.

@paulsavides
Copy link
Author

Thank you Josh!

@JoshLove-msft
Copy link
Member

Thank you Josh!

Thank you for finding and fixing this issue!

@DorothySun216
Copy link
Contributor

@paulsavides we have released a new nuget library with updated Nuget dependency 5.1.1, https://www.nuget.org/packages/Microsoft.Azure.ServiceBus/, can you see if this issue goes away?

@paulsavides
Copy link
Author

Hello @DorothySun216,

I have update my reproduction repo listed above to version 5.1.1 and was no longer able to reproduce the issue. Additionally, we have been directly using Microsoft.Azure.ServiceBus v2.4.9 in our services for around two months now and have not seen the issue reproduce.

Thank you, have a wonderful day!

@github-actions github-actions bot locked and limited conversation to collaborators Mar 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team. Service Bus
Projects
None yet
Development

No branches or pull requests

5 participants