Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support scheduled actions and cancellation #419

Merged
merged 7 commits into from
May 24, 2022

Conversation

michel-laterman
Copy link
Contributor

@michel-laterman michel-laterman commented May 11, 2022

What does this PR do?

Support scheduled actions by adding a new queue that actions will be
added to/removed from before they are sent to the dispatcher. The queue
is a priority queue (ordered by start_time). fleet_gateway is
responsible for syncing the queue to storage. Cancellation of an action
will be handled by a new action dispatcher that will remove actions from
the queue (if any) and update the targetID action status.

Why is it important?

Fleet-server should be able to inform agents of actions that are scheduled to start at a later time.
Currently the only supported action is an upgrade. This is to allow users to schedule upgrades during maintenance windows.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Note: For testing I altered some logging statements locally to better surface queue interactions (and scheduling in fleet-server.

Setup

Running Elasticsearch, Kibana, and elastic-agent/fleet-server instance locally.

Create a role in Elasticsearch that allowed a user to insert actions into .fleet-actions:

curl -XPOST "https://192.168.68.100:9200/_security/role/fleet-access" -H "kbn-xsrf: reporting" -H "Content-Type: application/json" -d'
{
  "indices": [
    {
      "names": [
        ".fleet*"
      ],
      "privileges": [
        "create_doc",
        "create",
        "delete",
        "index",
        "write",
        "all"
      ],
      "allow_restricted_indices": true
    }
  ]
}'

Queue Success

Using a user with the fleet-access role, index a new action for the elastic-agent:

curl -k -XPOST "https://localhost:9200/.fleet-actions/_doc" -H "X-elastic-product-origin: fleet" -H "kbn-xsrf: reporting" -H "Content-Type: application/json" -d'
{
  "agents": ["1d729367-9215-4af4-9547-339b3c1dde55"],
  "type": "UPGRADE",
  "expiration": "2022-05-20T16:55:00.000Z",
  "start_time": "2022-05-20T16:50:00.000Z",
  "data": {
    "version": "8.3.0-SNAPSHOT"
  },
  "action_id": "test-up"
}'

Validate fleet-server was propagating start-time

{"log.level":"info","ecs.version":"1.6.0","service.name":"fleet-server","@timestamp":"2022-05-20T16:46:32.938Z","message":"QUEUE offset  start time 2022-05-20T16:50:00Z"}
{"log.level":"info","ecs.version":"1.6.0","service.name":"fleet-server","fleet.agent.id":"1d729367-9215-4af4-9547-339b3c1dde55","http.request.id":"01G3H5PSERQ1V1VTT3EZ1Q20EY","fleet.access.apikey.id":"IYdN4oABBxHoCwmrIomQ","ackToken":"SYdd4oABBxHoCwmrEpbU","createdAt":"","id":"test-up","type":"UPGRADE","inputType":"","timeout":0,"@timestamp":"2022-05-20T16:46:32.938Z","message":"Action delivered to agent on checkin"}

Validate elastic-agent was queuing it locally:

{"log.level":"info","@timestamp":"2022-05-20T16:46:32.979Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":233},"message":"QUEUE Adding action id: test-up to queue.","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-05-20T16:46:32.979Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":192},"message":"QUEUE Gathered 0 actions from queue, 0 actions expired","ecs.version":"1.6.0"}

The checkin after start_time will execute a queued action:

{"log.level":"info","@timestamp":"2022-05-20T16:49:20.820Z","log.logger":"composable.providers.docker","log.origin":{"file.name":"docker/watcher.go","file.line":309},"message":"No events received within 10m0s, restarting watch call","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-05-20T16:51:18.742Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":192},"message":"QUEUE Gathered 1 actions from queue, 0 actions expired","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-05-20T16:51:18.782Z","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2022-05-20T09:51:18-07:00 - message: Application: [1d729367-9215-4af4-9547-339b3c1dde55]: State changed to UPDATING: Update to version '8.3.0-SNAPSHOT' started - type: 'STATE' - sub_type: 'UPDATING'","ecs.version":"1.6.0"}

Canceling an action

Note: agent/fleet-server was reinstalled for this test
Created an upgrade request for a future time:

curl -k -XPOST "https://localhost:9200/.fleet-actions/_doc" -H "X-elastic-product-origin: fleet" -H "kbn-xsrf: reporting" -H "Content-Type: application/json" -d'
{
  "agents": ["bed233e3-a447-4005-aa1f-950b9aae1ec6"],
  "type": "UPGRADE",
  "expiration": "2022-05-20T18:50:00.000Z",
  "start_time": "2022-05-20T18:40:00.000Z",
  "data": {
    "version": "8.3.0-SNAPSHOT"
  },
  "action_id": "test-up-c"
}'

Ensure the agent queued it:

{"log.level":"info","@timestamp":"2022-05-20T18:03:38.517Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":233},"message":"QUEUE Adding action id: test-up-c to queue.","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-05-20T18:03:38.517Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":192},"message":"QUEUE Gathered 0 actions from queue, 0 actions expired","ecs.version":"1.6.0"}

Ensure it is persisted in state.yml:

...
action_queue:
- action_id: test-up-c
  type: UPGRADE
  start_time: "2022-05-20T18:40:00Z"
  expiration: "2022-05-20T18:50:00.000Z"
  version: 8.3.0-SNAPSHOT

Send a CANCEL action:

curl -k -XPOST "https://localhost:9200/.fleet-actions/_doc" -H "X-elastic-product-origin: fleet" -H "kbn-xsrf: reporting" -H "Content-Type: application/json" -d'
{
  "agents": ["bed233e3-a447-4005-aa1f-950b9aae1ec6"],
  "type": "CANCEL",
  "expiration": "2022-05-20T18:50:00.000Z",
  "data": {
    "target_id": "test-up-c"
  },
  "action_id": "test-up-cancel-exp"
}'

Note that fleet-server queries for actions include an expiration range. All actions must have a valid expiration
Ensure the agent cancels the action:

{"log.level":"info","@timestamp":"2022-05-20T18:05:53.131Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":252},"message":"QUEUE Action type CANCEL","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-05-20T18:05:53.131Z","log.origin":{"file.name":"handlers/handler_action_cancel.go","file.line":45},"message":"Cancel action id: %s target id: %s removed %d action(s) from queue.test-up-cancel-exptest-up-c1","ecs.version":"1.6.0"}

Check state.yml, the action_queue attribute is no longer present.

Action expiration

Create an upgrade action with expiration < start time, and expiration at a future date:

curl -k -XPOST "https://localhost:9200/.fleet-actions/_doc" -H "X-elastic-product-origin: fleet" -H "kbn-xsrf: reporting" -H "Content-Type: application/json" -u user:changeme -d'
{
  "agents": ["bed233e3-a447-4005-aa1f-950b9aae1ec6"],
  "type": "UPGRADE",
  "expiration": "2022-05-20T18:30:00.000Z",
  "start_time": "2022-05-20T18:40:00.000Z",
  "data": {
    "version": "8.3.0-SNAPSHOT"
  },
  "action_id": "test-up-exp"
}'

Validate that the action is queued:

{"log.level":"info","@timestamp":"2022-05-20T18:22:33.054Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":233},"message":"QUEUE Adding action id: test-up-exp to queue.","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-05-20T18:22:33.054Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":192},"message":"QUEUE Gathered 0 actions from queue, 0 actions expired","ecs.version":"1.6.0"}

Validate the action is cancelled after start_time:

{"log.level":"info","@timestamp":"2022-05-20T18:41:50.337Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":192},"message":"QUEUE Gathered 0 actions from queue, 1 actions expired","ecs.version":"1.6.0"}

Related issues

Support scheduled actions by adding a new queue that actions will be
added to/removed from before they are sent to the dispatcher. The queue
is a priority queue (ordered by start_time). fleet_gateway is
responsible for syncing the queue to storage. Cancellation of an action
will be handled by a new action dispatcher that will remove actions from
the queue (if any) and update the targetID action status.

TODO
- cancel handler
- action expiration
- fleet_gateway tests
@michel-laterman michel-laterman added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels May 11, 2022
@mergify
Copy link
Contributor

mergify bot commented May 11, 2022

This pull request does not have a backport label. Could you fix it @michel-laterman? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@elasticmachine
Copy link
Collaborator

elasticmachine commented May 11, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-05-24T02:24:45.639+0000

  • Duration: 16 min 45 sec

Test stats 🧪

Test Results
Failed 0
Passed 5973
Skipped 23
Total 5996

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)


// FleetAction represents an action from fleet-server.
// should copy the action definition in fleet-server/model/schema.json
type FleetAction struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to separate what we receive from fleet as far as check ins go instead of using ActionApp to better delineate concerns

internal/pkg/fleetapi/action.go Outdated Show resolved Hide resolved
errMsg = fmt.Sprintf("failed to persist action_queue, error: %s", err)
f.log.Error(errMsg)
f.statusReporter.Update(state.Failed, errMsg, nil)
// TODO should we handle this failure differently?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do anything differently if we are unable to save state here?

return nil
}
h.log.Info("Cancel action id: %s target id: %s removed %d action(s) from queue.", action.ActionID, action.TargetID, n)
// TODO ack action.TargetID as failed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we want to call when we have cancelled a queued action successfully?

@elasticmachine
Copy link
Collaborator

elasticmachine commented May 11, 2022

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 97.222% (70/72) 👍 0.121
Files 68.919% (153/222) 👍 0.227
Classes 68.694% (305/444) 👍 0.089
Methods 52.752% (853/1617) 👍 0.346
Lines 38.709% (9042/23359) 👍 0.03
Conditionals 100.0% (0/0) 💚

Copy link
Contributor

@ph ph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michel-laterman I answered a few inline question here, Can I do an early review in draft or should I wait?

@michel-laterman
Copy link
Contributor Author

/test

@mergify
Copy link
Contributor

mergify bot commented May 20, 2022

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b queue-actions upstream/queue-actions
git merge upstream/main
git push upstream queue-actions

Copy link
Member

@AndersonQ AndersonQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any blockers, but I left some suggestions.

}
// Dispatch cancel actions
if len(cancelActions) > 0 {
if err := f.dispatcher.Dispatch(context.Background(), f.acker, cancelActions...); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question | Suggestion]
Do we expect to have a context to pass to Dispatch at some point? If so, I'd suggest to use context.TODO() to indicate that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensuring proper contexts is a larger issue then what I would like to do for this PR. I've made #464 to track it

Comment on lines +133 to +135
assert.Equal(t, "testid", action.ActionID)
assert.Equal(t, ActionTypePolicyChange, action.ActionType)
assert.NotNil(t, action.Policy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion]
Check the return of StartTime and Expiration to ensure the given start_time and expiration are ignored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out actions will have expiration times, however that is only used (by elastic agent) when searching ES for actions

internal/pkg/queue/actionqueue.go Outdated Show resolved Hide resolved
internal/pkg/queue/actionqueue.go Outdated Show resolved Hide resolved
internal/pkg/queue/actionqueue.go Outdated Show resolved Hide resolved
internal/pkg/queue/actionqueue.go Outdated Show resolved Hide resolved
@mergify
Copy link
Contributor

mergify bot commented May 24, 2022

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b queue-actions upstream/queue-actions
git merge upstream/main
git push upstream queue-actions

Co-authored-by: Anderson Queiroz <me@andersonq.me>
@michel-laterman michel-laterman merged commit ffe77e8 into elastic:main May 24, 2022
@michel-laterman
Copy link
Contributor Author

i've ignored the linter in order to get it in for 8.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support scheduled actions
5 participants