Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cmd/opampsupervisor] Supervisor fails healthcheck with bootstrap config #31897

Closed
BinaryFissionGames opened this issue Mar 21, 2024 · 6 comments
Labels
bug Something isn't working cmd/opampsupervisor

Comments

@BinaryFissionGames
Copy link
Contributor

BinaryFissionGames commented Mar 21, 2024

Component(s)

cmd/opampsupervisor

What happened?

Description

When running the bootstrap config, the agent doesn't have the healthcheck configuration configured. After bootstrapping, the collector still uses the bootstrap config, which causes the healthcheck to fail every health check interval.

Steps to Reproduce

Start supervisor. Wait for health check to occur.

Expected Result

Agent is considered healthy.

Actual Result

2024-03-20T16:18:56.705-0400    ERROR   supervisor/supervisor.go:673    Agent is not healthy    {"error": "Get \"http://localhost:57806\": dial tcp [::1]:57806: connect: connection refused"}
github.com/open-telemetry/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor.(*Supervisor).healthCheck
        /Users/brandonjohnson/git_repos/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor/supervisor.go:673
github.com/open-telemetry/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor.(*Supervisor).runAgentProcess
        /Users/brandonjohnson/git_repos/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor/supervisor.go:733
github.com/open-telemetry/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor.NewSupervisor.func1
        /Users/brandonjohnson/git_repos/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor/supervisor.go:169

Collector version

5bf424d

Environment information

No response

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

@BinaryFissionGames BinaryFissionGames added bug Something isn't working needs triage New item requiring triage labels Mar 21, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@Frapschen
Copy link
Contributor

It appears that there is a network error. Can you confirm if your agent is exposing the 57806 port?

@BinaryFissionGames
Copy link
Contributor Author

BinaryFissionGames commented Mar 22, 2024

Well, that's exactly it. The agent does not expose that port, but it should, since the config is generated and run by the supervisor.

receivers:
  otlp:
    protocols:
      http:
        endpoint: "localhost:58262"
exporters:
  debug:
    verbosity: basic

extensions:
  opamp:
    instance_uid: "01HSERAG6T06AFVGQT5ZYC0GEK"
    server:
      ws:
        endpoint: "ws://localhost:58263/v1/opamp"
        tls:
          insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [debug]
  extensions: [opamp]

I think the "no-op" config/bootstrap config that the collector runs just needs the health extension config merged into it.

I feel like part of the problem is that the bootstrap config overwrites the collector config (they use the same file), so this is always the initial config for the collector.

Maybe that should be its own separate issue, though.

@atoulme atoulme removed the needs triage New item requiring triage label Mar 30, 2024
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label May 30, 2024
@cforce
Copy link

cforce commented Jul 2, 2024

It is a bit frustrating that the supervisor generates an effective.yaml either before or after, or in both cases, it connects to Opamp Backend and there is not yet a cfg assigned. This embedded configuration lacks at least a proper health check, so the supervisor does not recognize that the collector is up and running properly. As a result, this causes continuous error logs or even restarts of the collector. In offline mode (when the supervisor is not connected to any OpAMP server), the supervisor should come with a basic working collector configuration.

I tried wo workaround by setting up a configuration in the supervisor's YAML file for the collector with a correct path which merged the missing healhtcheck, but this also does not work
My expectation is that if the configuration path is not set, the supervisor should at least create an effective.yaml that meets its own requirements based on static embedded default cfg.
And as requested as (optional) feature the collector shall not be started at all until there is a cfg locally cfg to be merged or the opamp backend will send a cfg update to overwrite the default (noop). In this case also the error logs madness will not trigger - see my feature request #33680

@github-actions github-actions bot removed the Stale label Jul 3, 2024
@BinaryFissionGames
Copy link
Contributor Author

I believe this ended up fixed in #34159

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cmd/opampsupervisor
Projects
None yet
Development

No branches or pull requests

4 participants