Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename to management.Manager, add UpdateStatus to Manager interface. #19114

Merged
merged 4 commits into from
Jun 12, 2020

Conversation

blakerouse
Copy link
Contributor

@blakerouse blakerouse commented Jun 10, 2020

What does this PR do?

Renames management.ConfigManager to management.Manager being that it now does more than just manage configurations. Adds UpdateStatus to management.Manager to allow a beat to update the status.

Using FleetManager the status will be propagated over the elastic-agent-client to the controlling agent.

Why is it important?

So status can be reported both to the controlling Agent and up to Fleet for observability of the beats that Agent is running.

Reporting a Failed status to Agent will result in the Agent stopping and restarting the beat.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

@blakerouse blakerouse requested a review from a team as a code owner June 10, 2020 15:52
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jun 10, 2020
@blakerouse blakerouse added Team:Ingest Management Team:Services (Deprecated) Label for the former Integrations-Services team labels Jun 10, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ingest-management (Team:Ingest Management)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jun 10, 2020
@blakerouse blakerouse self-assigned this Jun 10, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-services (Team:Services)

@elasticmachine
Copy link
Collaborator

elasticmachine commented Jun 10, 2020

💚 Build Succeeded

Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: [Pull request #19114 updated]

  • Start Time: 2020-06-11T12:41:04.433+0000

  • Duration: 81 min 18 sec

Test stats 🧪

Test Results
Failed 0
Passed 9437
Skipped 1574
Total 11011

Steps errors

Expand to view the steps failures

  • Name: Report to Codecov
    • Description: curl -sSLo codecov https://codecov.io/bash for i in auditbeat filebeat heartbeat libbeat metricbeat packetbeat winlogbeat journalbeat do FILE="${i}/build/coverage/full.cov" if [ -f "${FILE}" ]; then bash codecov -f "${FILE}" fi done

    • Duration: 2 min 22 sec

    • Start Time: 2020-06-11T13:11:54.685+0000

    • log

@ph ph requested review from michalpristas and urso June 10, 2020 17:09
func (*nilManager) Stop() {}
func (*nilManager) CheckRawConfig(cfg *common.Config) error { return nil }
func (n *nilManager) UpdateStatus(status Status, msg string) {
n.lock.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this here if this is nil manager - noop manager?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do it only for logging the status changes. I think it is good to show status updates in the logger as well, so it's clear what is happening in the beat, even if its not controlled by Elastic Agent.

func (cm *Manager) OnConfig(s string) {
cm.client.Status(proto.StateObserved_CONFIGURING, "Updating configuration")
cm.UpdateStatus(management.Configuring, "Updating configuration")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we switch back after config is applied or do we let beat to do that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does switch back to Running at the end of OnConfig.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you;re right it was hidden so i overlooked

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I read this right OnConfig sets the status to 'Healthy' at the end. Are we sure this is the correct state to report? Do we 'forget' the internal state here?

How about removing the 'Configuring' status? The less states (and protocol behavior) we have, the less we can get it wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally do not like the idea of setting the status in the OnConfig handler, I feel like that should really only be done specifically by the beat itself. So if OnConfig passes it to the reloader then the reloader should really set the Status.

At the moment I keep it this way so it works, and once status is being reported correctly by the beat, this should be removed (probably at the same time).

I do not think we should removing Configuring I think it is a key state for watching the observability of the Beat and propogating that information to Fleet.

It is good to know that going Healthy -> Configuring -> Failed is because everything was good, then updating the configuration caused it to fail.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally do not like the idea of setting the status in the OnConfig handler, I feel like that should really only be done specifically by the beat itself. So if OnConfig passes it to the reloader then the reloader should really set the Status.

At the moment I keep it this way so it works, and once status is being reported correctly by the beat, this should be removed (probably at the same time).

+1
Is this required change/cleanup documented in an issue?

@@ -285,3 +303,25 @@ func (cm *Manager) toConfigBlocks(cfg common.MapStr) (api.ConfigBlocks, error) {

return res, nil
}

func statusToProtoStatus(status management.Status) proto.StateObserved_Status {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add unit test to test switching statuses back and forth with some noop client?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment I don't believe client from elastic-agent-client is exposed as an interface. To allow me to completely replace it with a noop client.

if cm.status != status || cm.msg != msg {
cm.status = status
cm.msg = msg
cm.lock.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should test this with concurrent request, what i fear may occur is that wit many concurrent changes of status
we will end up with different internal state vs reported state
do you see some pros/cons of doing reporting using a channel or having full flow lock?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also try to 'forget' outdated status updates if we get new ones while we are waiting. This could be implemented using sync.Cond.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only reason to track the state at this level is so it can be logged when it is changed. We could remove the internal tracking and locking in the Fleet Manager if we just log every time UpdateStatus is called. Then the locking inside of client will handle updating the correct state to the manager.

Or we could place the call to client inside of the lock to ensure that its always the same between the 2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think client options is also ok. i think logging inconsistent states will cause us more trouble then help us during some sdh triaging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with moving the logging and client into the lock.

CheckRawConfig(cfg *common.Config) error

// UpdateStatus called when the status of the beat has changed.
UpdateStatus(status Status, msg string)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving the UpdateStatus method into it's own interface? I'd like to push only 'status' reporting to the inputs/outputs and provide a set of helper functions that wrap a 'StatusReporter', merging status changes from multiple subsystems. But in the end we can also introduce the minimal interface in another package when we need it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move it to its own interface, would it also still be in the management.Manager interface?

type Manager interface {
    StatusReporter

    ...
}

cm.logger.Infof("Status change to %s: %s", status, msg)
return
}
cm.lock.Unlock()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why unlock the call to client.Status? If we have a race between unlock and the Status being send by two go-routines, how can you tell which one is the newer status update?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was commenting about on that as well, I can move the client.Status call inside of the lock. I think that would be the best.

@blakerouse
Copy link
Contributor Author

@michalpristas @urso This is ready for another review.

@ph ph requested a review from urso June 12, 2020 12:15
func (cm *Manager) OnConfig(s string) {
cm.client.Status(proto.StateObserved_CONFIGURING, "Updating configuration")
cm.UpdateStatus(management.Configuring, "Updating configuration")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you;re right it was hidden so i overlooked

cm.status = status
cm.msg = msg
cm.lock.Unlock()
cm.client.Status(statusToProtoStatus(status), msg)
Copy link
Contributor

@michalpristas michalpristas Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we update status on the client which will result in some action either during checkin or in watcher after some timeout.
if somebody reports failure i think (correct me if my memory failed) we talked about re-starting application immediately, can you make sure this flow is valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flow is valid from the side of the Elastic Agent, as reporting Failed will cause the application to be killed and restarted. This is fully tested in covered in the Elastic Agent unit tests, both in the case that the application returns Failed over the protocol and if the application just crashes.

return
}

blocks, err := cm.toConfigBlocks(configMap)
if err != nil {
err = errors.Wrap(err, "could not apply the configuration")
cm.logger.Error(err)
cm.client.Status(proto.StateObserved_FAILED, err.Error())
cm.UpdateStatus(management.Failed, err.Error())
Copy link

@urso urso Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given all these errors I wonder what 'Failed' means. Is the beat really 'failed' or are we (the overall application with agent) running in a degraded mode because not all configurations could have been applied? Even if we fail here the Beat continues to publish events for existing inputs, right?

Are we mixing the status of the 'operation' with the beat health here? If the operation fails, are we required to restart the Beat?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated in the list of statuses there is Failed and Degraded.

        // Degraded is status describing application is degraded.
	Degraded
	// Failed is status describing application is failed. This status should
	// only be used in the case the beat should stop running as the failure
	// cannot be recovered.
	Failed

In Degarded state beat can keep sending events and agent will not stop the process. Failed is failed, I have completely failed, restart me.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've seen these status types. Is this method run on startup only, or also if the beat is already active? In the later case the Beat will continue with the old state if there was an error decoding the configuration. Do we want to enforce a 'restart' if the config update operation failed? Will Failed trigger a restart by the Agent?

Are we mixing the status of the update 'operation' with the beat health here? If the operation fails, are we required to restart the Beat?

I'm more or less wondering about the overal semantics of OnConfig and the single errors we can get.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will run both on startup and while running.

The beat doesn't need to worry about handling the restart on Failed status, the Agent will do that work for it, restart it and bring it back up.

Copy link
Contributor

@michalpristas michalpristas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good for first iteration and exact form will crystallize once we have at least some healthcheck implemented

@blakerouse blakerouse merged commit 05c9065 into elastic:master Jun 12, 2020
@blakerouse blakerouse deleted the libbeat-manager-status branch June 12, 2020 17:22
blakerouse added a commit to blakerouse/beats that referenced this pull request Jun 12, 2020
…lastic#19114)

* Rename management.ConfigManager to management.Manager, add UpdateStatus to Manager interface.

* Update docstring for Failed status.

* Add to developer changelog.

* Add StatusReporter interface, wrap client.Status in lock. Add tests for statusToProtoStatus.

(cherry picked from commit 05c9065)
blakerouse added a commit that referenced this pull request Jun 12, 2020
…19114) (#19172)

* Rename management.ConfigManager to management.Manager, add UpdateStatus to Manager interface.

* Update docstring for Failed status.

* Add to developer changelog.

* Add StatusReporter interface, wrap client.Status in lock. Add tests for statusToProtoStatus.

(cherry picked from commit 05c9065)
v1v added a commit to v1v/beats that referenced this pull request Jun 15, 2020
…ngs-archive

* upstream/master:
  Fix minor spelling error in Jenkinsfile (elastic#19153)
  [CI] fail if not possible to install python3 (elastic#19164)
  [Elastic Agent] Improved mage demo experience (elastic#18445)
  [yum] Clear cached data and add retry loop (elastic#19182)
  fix lint job by updating NOTICE (elastic#19161)
  Fix tags for coredns/envoyproxy (elastic#19134)
  Disable host.* fields by default for CrowdStrike module (elastic#19132)
  Allow host.* fields to be disabled in Zeek module (elastic#19113)
  Rename to management.Manager, add UpdateStatus to Manager interface. (elastic#19114)
  Edit Elastic Agent docs (elastic#19146)
  [JJBB] create job definition for the golang-crossbuild project (elastic#19162)
  Fix incorrect usage of hints builder when exposed port is a substring of the hint (elastic#19052)
  Add basic cloudfoundry integration tests (elastic#19018)
melchiormoulin pushed a commit to melchiormoulin/beats that referenced this pull request Oct 14, 2020
…lastic#19114)

* Rename management.ConfigManager to management.Manager, add UpdateStatus to Manager interface.

* Update docstring for Failed status.

* Add to developer changelog.

* Add StatusReporter interface, wrap client.Status in lock. Add tests for statusToProtoStatus.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Services (Deprecated) Label for the former Integrations-Services team v7.9.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Agent] Implement an HealthCheck function in every beats.
4 participants