-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support scheduled actions and cancellation #419
Conversation
Support scheduled actions by adding a new queue that actions will be added to/removed from before they are sent to the dispatcher. The queue is a priority queue (ordered by start_time). fleet_gateway is responsible for syncing the queue to storage. Cancellation of an action will be handled by a new action dispatcher that will remove actions from the queue (if any) and update the targetID action status. TODO - cancel handler - action expiration - fleet_gateway tests
This pull request does not have a backport label. Could you fix it @michel-laterman? 🙏
NOTE: |
|
||
// FleetAction represents an action from fleet-server. | ||
// should copy the action definition in fleet-server/model/schema.json | ||
type FleetAction struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to separate what we receive from fleet as far as check ins go instead of using ActionApp
to better delineate concerns
errMsg = fmt.Sprintf("failed to persist action_queue, error: %s", err) | ||
f.log.Error(errMsg) | ||
f.statusReporter.Update(state.Failed, errMsg, nil) | ||
// TODO should we handle this failure differently? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to do anything differently if we are unable to save state here?
return nil | ||
} | ||
h.log.Info("Cancel action id: %s target id: %s removed %d action(s) from queue.", action.ActionID, action.TargetID, n) | ||
// TODO ack action.TargetID as failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we want to call when we have cancelled a queued action successfully?
bc04209
to
175973c
Compare
175973c
to
5ec423e
Compare
🌐 Coverage report
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michel-laterman I answered a few inline question here, Can I do an early review in draft or should I wait?
/test |
This pull request is now in conflicts. Could you fix it? 🙏
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any blockers, but I left some suggestions.
} | ||
// Dispatch cancel actions | ||
if len(cancelActions) > 0 { | ||
if err := f.dispatcher.Dispatch(context.Background(), f.acker, cancelActions...); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question | Suggestion]
Do we expect to have a context to pass to Dispatch
at some point? If so, I'd suggest to use context.TODO()
to indicate that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensuring proper contexts is a larger issue then what I would like to do for this PR. I've made #464 to track it
assert.Equal(t, "testid", action.ActionID) | ||
assert.Equal(t, ActionTypePolicyChange, action.ActionType) | ||
assert.NotNil(t, action.Policy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Suggestion]
Check the return of StartTime
and Expiration
to ensure the given start_time
and expiration
are ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turns out actions will have expiration
times, however that is only used (by elastic agent) when searching ES for actions
This pull request is now in conflicts. Could you fix it? 🙏
|
Co-authored-by: Anderson Queiroz <me@andersonq.me>
i've ignored the linter in order to get it in for 8.3 |
What does this PR do?
Support scheduled actions by adding a new queue that actions will be
added to/removed from before they are sent to the dispatcher. The queue
is a priority queue (ordered by start_time). fleet_gateway is
responsible for syncing the queue to storage. Cancellation of an action
will be handled by a new action dispatcher that will remove actions from
the queue (if any) and update the targetID action status.
Why is it important?
Fleet-server should be able to inform agents of actions that are scheduled to start at a later time.
Currently the only supported action is an upgrade. This is to allow users to schedule upgrades during maintenance windows.
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
Note: For testing I altered some logging statements locally to better surface queue interactions (and scheduling in fleet-server.
Setup
Running Elasticsearch, Kibana, and elastic-agent/fleet-server instance locally.
Create a role in Elasticsearch that allowed a user to insert actions into
.fleet-actions
:Queue Success
Using a user with the
fleet-access
role, index a new action for the elastic-agent:Validate
fleet-server
was propagating start-timeValidate elastic-agent was queuing it locally:
The checkin after
start_time
will execute a queued action:Canceling an action
Note: agent/fleet-server was reinstalled for this test
Created an upgrade request for a future time:
Ensure the agent queued it:
Ensure it is persisted in
state.yml
:Send a
CANCEL
action:Note that fleet-server queries for actions include an expiration range. All actions must have a valid expiration
Ensure the agent cancels the action:
Check
state.yml
, theaction_queue
attribute is no longer present.Action expiration
Create an upgrade action with expiration < start time, and expiration at a future date:
Validate that the action is queued:
Validate the action is cancelled after
start_time
:Related issues