-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve logging at Arrow stream shutdown; avoid the explicit Canceled message at stream lifetime #170
Conversation
return str.anyStreamClient, nil | ||
} | ||
} | ||
|
||
// healthyTestChannel accepts the connection and returns an OK status immediately. | ||
type healthyTestChannel struct { | ||
lock sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lock is needed to avoid a race because CloseSend() now closes the sent channel, which some former tests were doing manually. Now CloseSend() is always called and the closed channel serves to assist with some tests.
case <-e.shutdown: | ||
err = context.Canceled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test flake was related to this code here. Shutdown does not cancel the request context, so we needed another signal to cleanly exit the exporter.
// This (client == nil) signals the controller to | ||
// downgrade when all streams have returned in that | ||
// status. | ||
// | ||
// This is a special case because we reset s.client, | ||
// which sets up a downgrade after the streams return. | ||
s.client = nil | ||
s.telemetry.Logger.Info("arrow is not supported", | ||
zap.String("message", status.Message()), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this is a special case carried over from the old code. The rest of this rewrite in this function body is a major simplification.
defer func() { | ||
s.client.CloseSend() | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is as much about cleaning up tests as it is about being consistent when the stream is shutting down. Always closing the send channel appears to be working and it's simpler.
return status.Error(codes.Canceled, "client stream shutdown") | ||
} else if errors.Is(err, context.Canceled) { | ||
return status.Error(codes.Canceled, "server stream shutdown") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the big change. Instead of returning an error, returning a status.Error() w/ the canceled code -- now the exporter will see that canceled in its reader and shut down cleanly.
My manual test consisted of three collectors w/ two arrow exporter/receiver pairs connecting an ordinary receiver and a debug exporter. There are no log messages at Info-level despite stream recycling, and shutdown will happen cleanly if the next hop is functioning. |
collector/exporter/otelarrowexporter/internal/arrow/exporter.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…t.go Co-authored-by: Laurent Quérel <laurent.querel@gmail.com>
Fixes #160.
Fixes a test-flake discovered while preparing open-telemetry/opentelemetry-collector-contrib#31996
For the test flake, rewrite the test. The test was relying on the behavior of a single stream at shutdown, whereas the test wanted to test the seamlessness of stream restart. The new test covers the intended condition better -- it runs a short lifetime stream and runs the test long enough for 5 stream lifetimes, then checks for no observed logs.
I carried out manual testing of the shutdown process to help with #160. The problem was that the exporter receives an error from both its writer and its reader goroutines, and it logs both. There is new, consistent handling of the EOF and Canceled error states, which are very similar. EOF is what happens when the server responds to the client's CloseSend(). Canceled is gRPC-internally generated.
Both exporter and receiver have similar error-logging logic now. Both consider EOF and Canceled the same condition, and will Debug-log a "shutdown" message when this happens. All other error conditions are logged as errors.
This removes the former use of the StatusCode called Canceled, it wasn't necessary after switching the code to use a Status-wrapped Canceled code instead. If receivers are updated ahead of exporters, good. If exporters are upgraded ahead of receivers, they will log spurious errors at stream shutdown (but as #160 shows they were already issuing spurious logs). The code remains defined in the protocol, as it is mapped to the gRPC code space, and remains handled by the code as an error condition.