Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropping small amount of spans on master #2042

Closed
gouthamve opened this issue Jan 27, 2020 · 7 comments · Fixed by #2044
Closed

Dropping small amount of spans on master #2042

gouthamve opened this issue Jan 27, 2020 · 7 comments · Fixed by #2044
Assignees
Labels

Comments

@gouthamve
Copy link
Contributor

Requirement - what kind of business use case are you trying to solve?

Trying to run master and checking if everything works.

Problem - what in Jaeger blocks you from solving the requirement?

Dropping a small number of spans

Any open questions to address

Here's more info:
We are currently running Jaeger with about 10K spans/sec on a host. But we noticed that rate(jaeger_collector_spans_dropped_total[5m]) is ~1 on upgrade to master. Which shouldn't be there, because we have an insane queue capacity (300K).

So I put in some debug logging

diff --git a/pkg/queue/bounded_queue.go b/pkg/queue/bounded_queue.go
index 077e7af..bbbabb7 100644
--- a/pkg/queue/bounded_queue.go
+++ b/pkg/queue/bounded_queue.go
@@ -16,6 +16,7 @@
 package queue
 import (
+       "fmt"
        "sync"
        "sync/atomic"
        "time"
@@ -99,8 +100,11 @@ func (q *BoundedQueue) Produce(item interface{}) bool {
        // we might have two concurrent backing queues at the moment
        // their combined size is stored in q.size, and their combined capacity
        // should match the capacity of the new queue
-       if q.Size() >= q.Capacity() {
+       size := q.Size()
+       capacity := q.Capacity()
+       if size >= capacity {
                // note that all items will be dropped if the capacity is 0
+               fmt.Println("size is bigger than capacity! dropping span!", size, capacity)
                q.onDroppedItem(item)
                return false
        }
@@ -111,6 +115,7 @@ func (q *BoundedQueue) Produce(item interface{}) bool {
                return true
        default:
                // should not happen, as overflows should have been captured earlier
+               fmt.Println("dropping span on overflow!")
                if q.onDroppedItem != nil {
                        q.onDroppedItem(item)
                }

And saw this in the logs:

size is bigger than capacity! dropping span! 4294967295 300000
size is bigger than capacity! dropping span! 4294967295 300000
size is bigger than capacity! dropping span! 4294967295 300000
size is bigger than capacity! dropping span! 4294967295 300000
size is bigger than capacity! dropping span! 4294967295 300000
size is bigger than capacity! dropping span! 4294967295 300000
size is bigger than capacity! dropping span! 4294967295 300000
size is bigger than capacity! dropping span! 4294967295 300000
size is bigger than capacity! dropping span! 4294967295 300000

So I see that recently the queue size handling moved to using ubers atomic package. And from int32 to uint32. I think before the size used to go to -1 but then it worked, but now it's broken.

@joe-elliott
Copy link
Member

joe-elliott commented Jan 27, 2020

Is there a race condition here:

	case *q.items <- item:
		q.size.Add(1)

If the goroutine is swapped off the thread between these lines and then another goroutine consumes the the queue item it could subtract before it adds.

	case item, ok := <-queue:
		if ok {
			q.size.Sub(1)

Is this a correct analysis?

@jpkrohling
Copy link
Contributor

I talked with @gouthamve on Gitter about this. We definitely need some protection to never go negative, but he's going to do some debugging to try to find out why it's going negative.

@jpkrohling
Copy link
Contributor

Is this a correct analysis?

@joe-elliott : thanks for the hint! It might be, as you have both the code and the situation, would you be able to give it a try and validate the theory?

@joe-elliott
Copy link
Member

@jpkrohling submitted a fix that's working in our environment #2044

Let me know if you'd rather approach it a different way.

@yurishkuro
Copy link
Member

What is actually going negative, the queue size gauge? (a) I wouldn't worry about it, or (b) just fix where gauge value is recorded to clamp negative values to zero. That gauge is only important when it's a very large number, minor fluctuations are irrelevant and mostly random due to an arbitrary reporting interval.

@jpkrohling
Copy link
Contributor

Yes, the queue size gauge (*uatomic.Uint32) is going negative, but I think it's a simpler bug rather than a concurrency data race. If @joe-elliott and @gouthamve have high confidence that this works in their highly concurrent environment, I think the linked PR is the right solution and we won't require a special safe check for negative values.

@gouthamve
Copy link
Contributor Author

This works at 10K spans/sec on a single collector :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants