Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reporting fatal error during shutdown process (even from async goroutine) deadlocks #9824

Open
bogdandrutu opened this issue Mar 22, 2024 · 1 comment
Labels
area:componentstatus bug Something isn't working

Comments

@bogdandrutu
Copy link
Member

The problem is that during shutdown the otelcol no longer looks for updates on the asyncErrorChannel, and if any component will try to write to that channel will get into a deadlock.

func (r *receiver) Start(_ context.Context, _ component.Host) error {
	r.server := &http.Server{
		Addr:           ":8080",
		Handler:        myHandler,
	}
	go func() {
		if err := r.server.ListenAndServe(); err != nil {
			// This is a small bug since err == http.ErrServerClosed is expected.
			r.settings.ReportStatus(component.NewFatalErrorEvent(err))
		}
	}()
	return nil
}

func (r *receiver) Shutdown(context.Context) error {
	if r.server == nil {
		return nil
	}
	return r.server.Close()
}
@bogdandrutu bogdandrutu added the bug Something isn't working label Mar 22, 2024
MovieStoreGuy pushed a commit to open-telemetry/opentelemetry-collector-contrib that referenced this issue May 22, 2024
**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
The kafka receiver's shutdown method is to cancel the context of a
running sub goroutine. However, a small bug was causing a fatal error to
be reported during shutdown when this expected condition was hit. The
fatal error being reported during shutdown was causing another bug to be
hit,
open-telemetry/opentelemetry-collector#9824.

This fix means that shutdown won't be blocked in expected shutdown
conditions, but the `core` bug referenced above means shutdown will
still be block in unexpected error situations.

This fix is being taken from a comment made by @Dennis8274 on the issue.

**Link to tracking Issue:** <Issue number if applicable>
Fixes #30789

**Testing:** <Describe what testing was performed and which tests were
added.>
Stepped through `TestTracesReceiverStart` in a debugger before the
change to see the fatal status being reported. It was no longer reported
after applying the fix. Manually tested running the collector with a
kafka receiver and saw that before the fix it was indeed being blocked
on a normal shutdown, but after the fix it shutdown as expected.
@spiffyy99
Copy link

Pointing out that this can also happen during startup or an update

basically whenever the control loop isn't actively running, components have free reign to call ReportStatus which will deadlock the caller goroutine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:componentstatus bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants