Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Explanation and approach to migrating and removing "siem-detection-engine-rule-actions" saved object #112327

Closed
FrankHassanabad opened this issue Sep 15, 2021 · 9 comments
Assignees
Labels
discuss Team:Detections and Resp Security Detection Response Team

Comments

@FrankHassanabad
Copy link
Contributor

FrankHassanabad commented Sep 15, 2021

We removed the legacy notification system during part 1 here:
#109722

Which included the code for the alerting side car saved object:

siem-detection-engine-rule-actions

You can see these saved object type through the query:

GET .kibana/_search
{
  "query": {
    "term": {
      "type": {
        "value": "siem-detection-engine-rule-actions"
      }
    }
  }
}

The contents of this saved object "side car" is near duplication of actions that should now be directly on the rules/alerts themselves and should no longer exist as a separate saved object:

{
  "siem-detection-engine-rule-actions" : {
     "ruleAlertId" : "fb1046a0-0452-11ec-9b15-d13d79d162f3", <--- Reference to SO id of rule "fb1046a0-0452-11ec-9b15-d13d79d162f3"
     "actions" : [
       {
         "action_type_id" : ".slack",
         "id" : "f6e64c00-0452-11ec-9b15-d13d79d162f3", <--- Reference to SO id of action "f6e64c00-0452-11ec-9b15-d13d79d162f3"
         "params" : {
           "message" : "Hourly\nRule {{context.rule.name}} generated {{state.signals_count}} alerts"
         },
         "group" : "default"
       }
     ],
     "ruleThrottle" : "1h",
     "alertThrottle" : "1h"
   },
   "type" : "siem-detection-engine-rule-actions",
   "references" : [ ],
   "migrationVersion" : {
     "siem-detection-engine-rule-actions" : "7.11.2"
   },
   "coreMigrationVersion" : "7.14.0",
   "updated_at" : "2021-08-26T21:56:30.233Z"
  }
}

As you can see we have two other saved object id's within our object which are:

"siem-detection-engine-rule-actions.ruleAlertId" : "fb1046a0-0452-11ec-9b15-d13d79d162f3"
"siem-detection-engine-rule-actions.actions.id" : "f6e64c00-0452-11ec-9b15-d13d79d162f3"

These are both saved object id's that are references to two different SO types of action and alert you can query against like so:

GET .kibana/_doc/action:21f1c6e0-09d5-11ec-908c-57a10fa3ee91

And then:

GET .kibana/_doc/alert:fb1046a0-0452-11ec-9b15-d13d79d162f3

Note: These do not have proper reference id's within them. This doesn't mean we can't add a saved object migration to move them before 7.16.0 or in conjunction with our plans which we might do depending on how we decide to migrate these. However, we have decided to do this migration before 7.16.0 in a small PR to eliminate this question or any issues around the SO id's being re-generated.

Second note is that we have this duplicated field:

"ruleThrottle" : "1h",
"alertThrottle" : "1h"

Which looking through the history seems to have been there for over a year and both are filled out.

What our goals are is to migrate all of the siem-detection-engine-rule-actions by:

  • Iterating through each each siem-detection-engine-rule-actions and moving their "actions" into the "alert" SO. Note we have to transform some data since the side car of "siem-detection-engine-rule-actions" has mismatched labels such as it using snake_case "action_type_id" vs. the alert using camel case.
  • Change the notifyWhen to be onThrottleInterval on the alert during migration
  • Change the throttle to be the value of ruleThrottle if available, otherwise alertThrottle, otherwise we can default to 1h to be safe.
  • Change muteAll to false
  • Delete the siem-detection-engine-rule-actions

This will then be a migration to where existing alerts/rules which had this "side car model" will now operate with the newer mechanisms and all the additional SO will be removed.

Approaches:
👎 1. If we do nothing then we have left over saved objects and the user will have to manually check all their notifications and re-adjust them. Users will have N number of left over saved objects and we run the risk of later migration issues down the road. We could optionally just delete these/remove them and do no migrations. Other risks is the user never notices that they suddenly do not have notifications if we just put in release notes they have to check for notifications that are unset during the upgrade.
🤷 2. We could write a REST interface which uses the alerting client API and put something such as a banner up and require users to click on the banner to migrate these by clicking on the button. Or we try to auto-migrate these. Risks are that the API keys will be changed out for the new user, or if they never visit the page then the risk is that they suddenly do not have notifications.
😁 3. We could at startup run a migration process either shortly after regular Kibana migration to do all the heavy lifting of manually migrating these with the same API key. Risks are that the code has to be maintained for a while and the alerting team could change their format of their API keys, etc... If we could have an API on startup to use with this approach that wouldn't change the API key that would be great and would mitigate this risk more. Draft PR for this is up as a proof of concept so far. Since this requires a "join" we cannot use the normal Kibana startup migration so far. Discussions are here, here, and here.

Conclusions and decisions:

So far for approach 1, no one has expressed they would like to do that approach. We do not have explicit stake holder/business/product telling us "no" on approach 1 but several people have voiced concerns over a do-nothing approach.

@FrankHassanabad FrankHassanabad self-assigned this Sep 15, 2021
@botelastic botelastic bot added the needs-team Issues missing a team label label Sep 15, 2021
@FrankHassanabad FrankHassanabad added the Team:Detections and Resp Security Detection Response Team label Sep 15, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@stacey-gammon
Copy link
Contributor

If you go with 3, wouldn't you still have the possibility of losing notifications because you can't guarantee the migration is complete by the time the Kibana server is up and running? I think that option sounds risky because you'd be running migrations while users are allowed to interact with their alerts (if I am understanding correctly). Would there be a possibility of a race condition? For example, a user adds an action which gets put onto rule.actions but then the migration runs and overwrites it with siem-detection-engine-rule-actions.actions? Disclaimer: I'm still trying to understand how it all works so apologies if I am off base!

The Kibana App team ran into the same requirement and managed to work around it by leveraging ad-hoc migrations when loading and saving the saved objects. I wonder if something similar could be achieved? (cc @flash1293 if you have any words of advice for how that went!).

  • Update siem-detection-engine-rule-actions to correctly use the references object so the ids are correct after the saved object id migration.
  • Migrate/delete when loading and saving the rule. Use an additional parameter so you only need run the extra query and migration if needed.
  • Leverage telemetry to track how many siem-detection-engine-rule-actions objects are left. Once it gets below a certain threshold, go with 1.?

@flash1293
Copy link
Contributor

The way we solved this for Graph is doing the reference lookup on read in the UI - this works for workspaces because they are only ever consumed this way, but it's not nice for a bunch of reasons:

  • It's possible the ad-hoc migration doesn't ever run because nobody is opening a workspace in the UI (e.g. exporting would not be able to export related saved objects)
  • We can never remove the code from deep within the app (because it's always possible there's an unmigrated workspace SO lingering)

I just skimmed the problem statement above but as far as I understood the reference ids are in the saved object, they are just not correctly placed in the references array, right? If I understood https://www.elastic.co/guide/en/kibana/master/sharing-saved-objects.html correctly, 7.16 is the last change to correctly place the references, so a simple migration moving these reference ids from ruleAlertId to the references array should suffice.

@mikecote
Copy link
Contributor

Update siem-detection-engine-rule-actions to correctly use the references object so the ids are correct after the saved object id migration.

++ without this, I don't think there's a way to guarantee siem-detection-engine-rule-actions.ruleAlertId and
siem-detection-engine-rule-actions.actions.id will point to the newly generated ID. An ad-hoc migration running after Kibana start could be migrating SOs that are already migrated to, say, 8.7 (long after the 8.0 migration generated new IDs).

@stacey-gammon
Copy link
Contributor

Update siem-detection-engine-rule-actions to correctly use the references object so the ids are correct after the saved object id migration.

Yea, I think that only solves half the problem though. The ids will be correct, but they still won't be able to migrate the content of siem-detection-engine-rule-actions into the alerting rule Saved Object because it would require a look-up during Saved Object Migrations. My understanding is that they want something like this:

migrateRule(rule) {
  const sideCar = findSavedObjectsWithReference('siem-detection-engine-rule-actions', rule.id);
  rule.actions += sideCar.actions;
  deleteSavedObject(sideCar);
  return rule;
}

Or the other way:

migrateSideCar(sideCar) {
  const rule = loadSavedObject('rule', rule.id);
  rule.actions += sideCar.actions;
  deleteSavedObject(sideCar);
  return undefined; // ??
}

As far as I understood from Core, the SO migration system doesn't support this kind of lookup during migrations.

@FrankHassanabad
Copy link
Contributor Author

FrankHassanabad commented Sep 16, 2021

I just skimmed the problem statement above but as far as I understood the reference ids are in the saved object, they are just not correctly placed in the references array, right?

That is being changed in a small upcoming PR to eliminate any questions or issues with any approach we choose and to make it so we don't have to carry baggage of doing saved object resolves at all.

Edited the ticket with these words to reflect this now:

Note: These do not have proper reference id's within them. This doesn't mean we can't add a saved object migration to move them before 7.16.0 or in conjunction with our plans which we might do depending on how we decide to migrate these. However, we have decided to do this migration before 7.16.0 in a small PR to eliminate this question or any issues around the SO id's being re-generated.

@flash1293
Copy link
Contributor

Ah, got it, thanks for the clarification. We discussed lookups during migration with the core team and decided against it for performance reasons.

Not super happy with the ad-hoc migration in there, but the plan with leveraging telemetry sounds good for phasing it out.

@lukeelmers
Copy link
Member

We discussed lookups during migration with the core team and decided against it for performance reasons.

More context on that here: #34996 (comment)

Not super happy with the ad-hoc migration in there, but the plan with leveraging telemetry sounds good for phasing it out.

I agree that, assuming I understand everything correctly, this currently sounds like the most viable (and least risky) path forward.

@rylnd
Copy link
Contributor

rylnd commented Sep 21, 2021

Following some offline meetings/discussion, we've come up with an option that should satisfy all parties for 7.16. Please review and let me know if you have questions/concerns:

The core ideas are to support both versions of actions, to funnel users through our own custom import/export APIs, and to check/migrate rules' actions when the rule is touched (create/import/save). There are a few pieces of work involved with this:

  1. Reverting the removal of sidecar (read: old) actions and supporting both types
  2. Adding code to perform the migration from sidecar actions to new actions on rule save
  3. Enabling the export of all associated rule SOs in our export API:
    • actions, connectors
    • exception lists, exceptions
    • notifications
  4. Telemetry to report how many users still have sidecar actions

Users can then update their rules ad hoc by touching them, or they can update all rules at once by exporting and reimporting (via our custom import/export APIs). In 8.x, once we have confidence (via telemetry), we can remove support for sidecar actions and remove our own import/export in favor of the SOM. Until then, we will discourage/prevent Detections users from managing rules there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Team:Detections and Resp Security Detection Response Team
Projects
None yet
Development

No branches or pull requests

7 participants