Improve logging at Arrow stream shutdown; avoid the explicit Canceled message at stream lifetime #170

jmacd · 2024-04-01T23:42:53Z

Fixes #160.
Fixes a test-flake discovered while preparing open-telemetry/opentelemetry-collector-contrib#31996

For the test flake, rewrite the test. The test was relying on the behavior of a single stream at shutdown, whereas the test wanted to test the seamlessness of stream restart. The new test covers the intended condition better -- it runs a short lifetime stream and runs the test long enough for 5 stream lifetimes, then checks for no observed logs.

I carried out manual testing of the shutdown process to help with #160. The problem was that the exporter receives an error from both its writer and its reader goroutines, and it logs both. There is new, consistent handling of the EOF and Canceled error states, which are very similar. EOF is what happens when the server responds to the client's CloseSend(). Canceled is gRPC-internally generated.

Both exporter and receiver have similar error-logging logic now. Both consider EOF and Canceled the same condition, and will Debug-log a "shutdown" message when this happens. All other error conditions are logged as errors.

This removes the former use of the StatusCode called Canceled, it wasn't necessary after switching the code to use a Status-wrapped Canceled code instead. If receivers are updated ahead of exporters, good. If exporters are upgraded ahead of receivers, they will log spurious errors at stream shutdown (but as #160 shows they were already issuing spurious logs). The code remains defined in the protocol, as it is mapped to the gRPC code space, and remains handled by the code as an error condition.

… into jmacd/combined

jmacd · 2024-04-01T23:45:11Z

collector/exporter/otelarrowexporter/internal/arrow/common_test.go

 		return str.anyStreamClient, nil
 	}
 }

 // healthyTestChannel accepts the connection and returns an OK status immediately.
 type healthyTestChannel struct {
+	lock sync.Mutex


This lock is needed to avoid a race because CloseSend() now closes the sent channel, which some former tests were doing manually. Now CloseSend() is always called and the closed channel serves to assist with some tests.

jmacd · 2024-04-01T23:46:28Z

collector/exporter/otelarrowexporter/internal/arrow/exporter.go

+		case <-e.shutdown:
+			err = context.Canceled


The test flake was related to this code here. Shutdown does not cancel the request context, so we needed another signal to cleanly exit the exporter.

jmacd · 2024-04-01T23:47:02Z

collector/exporter/otelarrowexporter/internal/arrow/stream.go

+			// This (client == nil) signals the controller to
+			// downgrade when all streams have returned in that
+			// status.
+			//
+			// This is a special case because we reset s.client,
+			// which sets up a downgrade after the streams return.
+			s.client = nil
+			s.telemetry.Logger.Info("arrow is not supported",
+				zap.String("message", status.Message()),
+			)


Note this is a special case carried over from the old code. The rest of this rewrite in this function body is a major simplification.

jmacd · 2024-04-01T23:48:15Z

collector/exporter/otelarrowexporter/internal/arrow/stream.go

+	defer func() {
+		s.client.CloseSend()
+	}()


This is as much about cleaning up tests as it is about being consistent when the stream is shutting down. Always closing the send channel appears to be working and it's simpler.

jmacd · 2024-04-01T23:49:14Z

collector/receiver/otelarrowreceiver/internal/arrow/arrow.go

+				return status.Error(codes.Canceled, "client stream shutdown")
+			} else if errors.Is(err, context.Canceled) {
+				return status.Error(codes.Canceled, "server stream shutdown")


This is the big change. Instead of returning an error, returning a status.Error() w/ the canceled code -- now the exporter will see that canceled in its reader and shut down cleanly.

jmacd · 2024-04-01T23:51:07Z

My manual test consisted of three collectors w/ two arrow exporter/receiver pairs connecting an ordinary receiver and a debug exporter. There are no log messages at Info-level despite stream recycling, and shutdown will happen cleanly if the next hop is functioning.

collector/exporter/otelarrowexporter/internal/arrow/exporter.go

lquerel

LGTM

collector/exporter/otelarrowexporter/internal/arrow/common_test.go

…t.go Co-authored-by: Laurent Quérel <laurent.querel@gmail.com>

jmacd added 5 commits March 29, 2024 16:29

Work on exporter shutdown, remove need for canceled

9c359f5

Return EOF on stream shutdown

d461e48

Merge branch 'jmacd/exporter_shutdown' of github.com:jmacd/otel-arrow…

2c1dcb7

… into jmacd/combined

manual testing

efa83b1

fix build

d6d2886

jmacd requested review from lquerel, moh-osman3 and codeboten as code owners April 1, 2024 23:42

jmacd commented Apr 1, 2024

View reviewed changes

moh-osman3 reviewed Apr 2, 2024

View reviewed changes

collector/exporter/otelarrowexporter/internal/arrow/exporter.go Outdated Show resolved Hide resolved

remove shutdown field

5570dcc

lquerel approved these changes Apr 2, 2024

View reviewed changes

collector/exporter/otelarrowexporter/internal/arrow/common_test.go Show resolved Hide resolved

jmacd and others added 3 commits April 2, 2024 15:19

Update collector/exporter/otelarrowexporter/internal/arrow/common_tes…

82e4f8c

…t.go Co-authored-by: Laurent Quérel <laurent.querel@gmail.com>

fmt

253c04d

Merge branch 'main' into jmacd/combined

20e07e9

jmacd merged commit 9551a08 into open-telemetry:main Apr 3, 2024
2 checks passed

jmacd deleted the jmacd/combined branch May 22, 2024 17:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve logging at Arrow stream shutdown; avoid the explicit Canceled message at stream lifetime #170

Improve logging at Arrow stream shutdown; avoid the explicit Canceled message at stream lifetime #170

jmacd commented Apr 1, 2024

jmacd Apr 1, 2024

jmacd Apr 1, 2024

jmacd Apr 1, 2024

jmacd Apr 1, 2024

jmacd Apr 1, 2024

jmacd commented Apr 1, 2024

lquerel left a comment

Improve logging at Arrow stream shutdown; avoid the explicit Canceled message at stream lifetime #170

Improve logging at Arrow stream shutdown; avoid the explicit Canceled message at stream lifetime #170

Conversation

jmacd commented Apr 1, 2024

jmacd Apr 1, 2024

Choose a reason for hiding this comment

jmacd Apr 1, 2024

Choose a reason for hiding this comment

jmacd Apr 1, 2024

Choose a reason for hiding this comment

jmacd Apr 1, 2024

Choose a reason for hiding this comment

jmacd Apr 1, 2024

Choose a reason for hiding this comment

jmacd commented Apr 1, 2024

lquerel left a comment

Choose a reason for hiding this comment