Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[alerting] decrypt errors during migration yield unmigrated alert saved objects #101582

Closed
pmuellr opened this issue Jun 8, 2021 · 22 comments · Fixed by #105968
Closed

[alerting] decrypt errors during migration yield unmigrated alert saved objects #101582

pmuellr opened this issue Jun 8, 2021 · 22 comments · Fixed by #105968
Assignees
Labels
bug Fixes for quality problems that affect the customer experience estimate:small Small Estimated Level of Effort Feature:Actions/Framework Issues related to the Actions Framework Feature:Actions Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting research Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Jun 8, 2021

This code in the alerting migration looks a bit suspect to me:

function executeMigrationWithErrorHandling(
migrationFunc: SavedObjectMigrationFn<RawAlert, RawAlert>,
version: string
) {
return (doc: SavedObjectUnsanitizedDoc<RawAlert>, context: SavedObjectMigrationContext) => {
try {
return migrationFunc(doc, context);
} catch (ex) {
context.log.error<AlertLogMeta>(
`encryptedSavedObject ${version} migration failed for alert ${doc.id} with error: ${ex.message}`,
{
migrations: {
alertDocument: doc,
},
}
);
}
return doc;
};
}

It appears that if an ESO decrypt error occurs during the execution of the migration of rules, the code falls into the catch logic, and so the error is logged, but the document in returned unchanged - not actually migrated. We've seen that logged message now in a user's deployment.

I believe the proper course of action in this case is to "fix" the decryption error by removing the API key, essentially disabling the rule, which should allow it to be migrated. This has the unfortunate effect of disabling a rule that was by definition not disabled before - but then presumably it wasn't running either, since it has a decrypt error. The other option is to not migrate it during normal migration, but instead migrate it later when it's actually needed - but that seems a lot harder.

I guess the other case where this can happen is if the encryption key is changed at the same time the migration takes place, in which case every rule and connector would be broken because of decryption errors. That almost seems worthy of failing the migration completely, but it's also not clear how we recognize that state compared to just a handful of "broken" ESOs.

Note that we can't simply replace the API key in the rule, since that requires a user context to create the API key, and during migration, we won't have one. We'd have to remove it, by "disabling" the rule.

The actions migration is very similar:

function executeMigrationWithErrorHandling(
migrationFunc: SavedObjectMigrationFn<RawAction, RawAction>,
version: string
) {
return (doc: SavedObjectUnsanitizedDoc<RawAction>, context: SavedObjectMigrationContext) => {
try {
return migrationFunc(doc, context);
} catch (ex) {
context.log.error<ActionsLogMeta>(
`encryptedSavedObject ${version} migration failed for action ${doc.id} with error: ${ex.message}`,
{
migrations: {
actionDocument: doc,
},
}
);
}
return doc;
};
}

Remediation is worse though, since we'll lose the secrets. But the good news is we do have a story for "actions that are missing secrets via export/import" - which is presumably how we'd set these up after the migration completes.

One thing I'm not sure about, is how to get this to work with the current ESO migration. It's a framework call, we just pass in functions to morph the raw saved object shapes, we don't have any way of handling decrypt errors differently than anything else (and what do we do when "anything else" errors occur!?). Feels like we'll need to beef up the ESO createMigration() function with some kind of error handling function, that would let us do some shape changes on the raw objects (to disable the rule, or mark the connector as "needs secrets"), and continue with the migration instead of failing.

@pmuellr pmuellr added bug Fixes for quality problems that affect the customer experience Feature:Alerting Feature:Actions Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jun 8, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member Author

pmuellr commented Jun 9, 2021

During conversation on this issue yesterday, I happened to think that I remember seeing Kibana logs that seemed to indicate it was possible to have saved objects migrated "as needed" somehow. Not sure if this is still something we can do - I'm not seeing the message right now in my dev env, and can't remember exactly what it said - but something about "not doing migrations now, will migrate saved objects on read" or something. But if this capability does exist, perhaps we don't really have an issue?

I suspect even if it is true, there will be some gotcha - we still can't really update the encryption key during one of these "as needed" migrations (I'm guessing here). But if there is such a capability, perhaps we could make use of it in whatever the solution for this is - like the migration could leave it in a state where an "as needed" migration would be possible.

@mikecote
Copy link
Contributor

mikecote commented Jun 9, 2021

Adding to To-Do so we can better understand the issue and what can be done to fix this, and if we should prioritize something further to fix the problem.

@pmuellr
Copy link
Member Author

pmuellr commented Jun 9, 2021

I just added a research tag - I think we need to understand if this is really a problem, how we plan on fixing it (it may require changes to the ESO migration code), and if there are work-arounds to prevent this from happening, or fix things after this breaks something.

@ymao1 ymao1 self-assigned this Jun 29, 2021
@gmmorris gmmorris added Feature:Actions/Framework Issues related to the Actions Framework Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework labels Jul 1, 2021
@gmmorris gmmorris added the loe:medium Medium Level of Effort label Jul 13, 2021
@ymao1
Copy link
Contributor

ymao1 commented Jul 13, 2021

After some research, it seems that the original decision to add the try/catch block in the migrations code stemmed from the span of time when alerting functionality was in beta and made use of the ephemeral encryption key. The thought was that we did not want to penalize users who were trying out a beta functionality by stopping Kibana upgrades due to failure to migrate their beta rules (which they may or may not actually be using) that used an ephemeral encryption key.

Edit After a discussion with @mikecote, it looks like the fear of having users that created rules with an ephemeral encryption key was unfounded as the PR to block this ability was merged before alerting beta was released.

OTOH, the recommended behavior from @elastic/kibana-core, as stated in this comment on the Saved Objects Migration RFC and from offline discussions, is that a migration failures should cause a failed upgrade, allowing users to immediately roll back to the previous version. This is considered more desired behavior than having the upgrade succeed with unmigrated saved objects that may cause other issues at a future time that are more difficult to trace.

I see three options here, although I don't think (1) is the right approach, so for the purposes of this discussion, I will only be focusing on (2) and (3).

(1) Disable saved object when decryption errors occur

Instead of returning the un-migrated document on decryption error, we can disable the rule/connector and apply the migration function to the disabled saved object. I think this is the least desirable option because for rules, if we want to fully and correctly disable a rule on failed migration without leaving any dangling references, we should also be removing the scheduled task associated with the rule and invalidating the API key. It feels kind of weird and wrong to be doing these things during a migration.

(2) Fail the upgrade when decryption errors occur

Upgrade fails with properly logged decryption errors. Log messages can be informative and tell users to check their encryption keys Users are able to roll back to previous Kibana version and fix their encryption key mismatch.

(3) Apply the migration on the unencrypted document

Even if decryption fails, apply the migration function. This would result in a correctly transformed document that will error when it runs post-migration due to broken AAD.


I see the following possible scenarios that might lead to decryption failures during migrations:

User unintentionally connects a Kibana with a different encryption key to a cluster AND upgrades this spurious Kibana, triggering a migration.

With option (2), the spurious Kibana would attempt to migrate rules and connectors and fail. Normal cluster would be unaffected.

With option (3), the spurious Kibana would attempt to migrate rules and connectors and succeed. Rules and connectors on the normal cluster would start failing due to broken AAD. Errors would show up in rule management UI but action execution errors would only show up in the logs. It would likely be unintuitive for user to track down why their rules and connectors were erroring. User would either need to roll back their accidental upgrade. However, there's the question of whether the cluster's Kibana would work at all since all saved objects, not just rules & connectors would have been migrated by the spurious Kibana.

Encryption key changed during upgrade (accidentally or intentionally).

With option (2), the upgrade would fail with log messages indicating to the admin/user to verify the correct encryption key was used and/or add old encryption key to decryption only config. If the configs are fixed, upgrade can be tried again.

With option (3), the upgrade would succeed but all rules and connectors on the normal cluster would start failing due to broken AAD. Errors would show up in rule management UI but action execution errors would only show up in the logs. It would likely be unintuitive for user to track down why their rules and connectors were erroring. User would either need to roll back their upgrade with a fixed encryption key or disable/reenable all their rules and re-enter secrets for all their connectors.

User changed encryption key during upgrade and did not back up the key

With option (2), the upgrade would fail with log messages indicating to the admin/user to verify the correct encryption key was used and/or add old encryption key to decryption only config. If the user did not back up the key, they cannot complete the upgrade unless they delete all the rules and connectors

With option (3), the upgrade would succeed but all rules and connectors would start failing due to broken AAD. Errors would show up in rule management UI but action execution errors would only show up in the logs. It would likely be unintuitive for user to track down why their rules and connectors were erroring. User would need to disable/reenable all their rules and re-enter secrets for all their connectors.

User has old rules and connectors using outdated encryption keys

This could happen if user tried out alerting but isn't actively using any rules or connectors. There could be rules running and continuously erroring, but they don't care and haven't cleaned them up.

With option (2), the upgrade would fail and the user might be confused as to why rule and connector migrations are blocking the upgrade when they don't use rules or connectors. Hopefully the log messages should provide enough context for them to delete these rules and connectors that were erroring anyway. However, since we've been catching decryption errors since 7.9 and returning unmigrated documents, it is possible that this will cause many many upgrade failures the first version after we change this behavior. All of the user's old rules/connectors could come back to haunt them.

With option (3), the migration would succeed. Rules and connectors that were erroring before would continue to error. User would never know and never need to know.

User has active working rules AND old rules and connectors using outdated encryption keys

User only uses security rules in the security UI but also has some older rules and connectors using outdated encryption keys that they never see errors for because they don't use the stack rules management page.

With option (2), the upgrade would fail and the user might be confused as to why. Hopefully the log messages should provide enough context for them to delete these rules and connectors that were erroring anyway. In the worst case, they might think that the migration failures are on their active rules/connectors and delete all rules and connectors.

With option (3), the migration would succeed. Active rules and connectors would continue to work and rules that were erroring before would continue to error.


@ymao1
Copy link
Contributor

ymao1 commented Jul 13, 2021

It seems to me that the worst case scenario for option (2) if a user actively using rules tries to upgrade and rotate encryption keys without backing up their old key. In this scenario, they would have to delete their rules and connectors in order to finish the upgrade.

The worst case scenario for option (3) seems to be if a user changes their encryption key during the upgrade (accidentally or forgets to set the decrypt only config), all their rules and connectors would be migrated but broken due to AAD. It may take some time for these broken rules/connectors to get noticed, take some more time to figure out why they are broken, and then they either need to revert the upgrade and redo it with the correct encryption key or spend time disabling and reenabling and reentering secrets. This could be very tedious if they have a lot of rules and connectors.

I think at this time I'm leaning toward option (3) as it is the option that would not lead to potential data loss, but I'm interested to know what others on @elastic/kibana-alerting-services think

@gmmorris
Copy link
Contributor

gmmorris commented Jul 14, 2021

First of all - I LOVE the level of detail, thanks @ymao1 :elasticheart:

I have a couple of questions before I can choose a preferred approach.

  1. regarding option (2): You wrote "they would have to delete their rules and connectors in order to finish the upgrade." - do they actually have this option in the Upgrade Assistant? How do they do this?
  2. regarding option (3): ESO migrations isn't an alerting feature, but rather a @elastic/kibana-security feature. Do they agree that this is a valid option? Would this decision happen in ESO code or Alerting code?
  3. another regarding option (3): This would mean we cannot rely on migrations to reliably reflect whether encrypted attributes have been migrated, but rather only the unencrypted ones. Can we use Typescript to make this clear? Perhaps limit the input/output of a migration on an ESO so that it only includes the non-encrypted fields so developers don't try to migrate the encrypted fields?

@ymao1
Copy link
Contributor

ymao1 commented Jul 14, 2021

  1. regarding option (2): You wrote "they would have to delete their rules and connectors in order to finish the upgrade." - do they actually have this option in the Upgrade Assistant? How do they do this?

I am unfamiliar with Upgrade Assistant. I have seen other (non alerting) cases where failed migrations cause upgrade failures and the suggestion is to delete the bad SO. Is that typically done through Upgrade Assistant?

  1. regarding option (3): ESO migrations isn't an alerting feature, but rather a @elastic/kibana-security feature. Do they agree that this is a valid option? Would this decision happen in ESO code or Alerting code?

Very interested in opinions from @elastic/kibana-security on this issue! I think the work would happen in the Alerting code, where we catch the decryption error. Currently now, in the catch block in the alerting code, we return doc where doc is the unmigrated saved object. I imagine any updates would happen in this part of the code, not the ESO code.

@mikecote
Copy link
Contributor

Writing some thoughts below on what my current thinking is.

User unintentionally connects a Kibana with a different encryption key to a cluster AND upgrades this spurious Kibana, triggering a migration.

Encryption key changed during upgrade (accidentally or intentionally).

User changed encryption key during upgrade and did not back up the key

I realized we do want to prevent Kibana from starting in these kinds of situations with #92654. My thinking here is that option (2) is similar to that direction. Especially since the check created in #92654 would only happen after an upgrade.

User has old rules and connectors using outdated encryption keys

User has active working rules AND old rules and connectors using outdated encryption keys

If we go with option (2), the remediation step for this scenario would be to fix or delete the broken rules/connectors and try upgrading again. If we go with option (3), we'd be moving corrupted data into the upgraded Kibana and rule executions would start (or continue) failing without the admin knowing.

By going with option (2), we may see a surge in upgrade failures. A single rule or connector with broken AAD / encrypted with a different encryption key would block the entire upgrade but I think this is a better option than potentially breaking the AAD on rules/connectors and hope admins read the Kibana logs and/or have rules that stop working after upgrade without notice. The person who would be looking into these kinds of issues would most likely be the Kibana administrator more than the rule author, so it feels like presenting the errors at that layer (to the Kibana admin during upgrade) sounds like a better option.

I would also like to validate the thinking and options with the tech leads (@kobelb, @stacey-gammon).

@legrego
Copy link
Member

legrego commented Jul 14, 2021

As of now, the upgrade assistant is only used when upgrading to the next major version. The upgrade assistant isn't available or useful when upgrading to the next minor or patch version.

Speaking more generally about encrypted saved objects, it seems strange that we'd want to migrate the plaintext portion of an ESO when we know that the encrypted payload is corrupt or unrecoverable.

As part of the security on by default initiative, we are introducing an interactive setup mode, which we think will lend itself nicely to future enhancements, such as interactive migrations (#100685), where we could have the user help us with the migration, or at least provide a friendlier migration experience. With that in mind, I could foresee giving the administrator a choice, where we can show them the alerts that can't be decrypted, and have them decide how to proceed. This does nothing to solve the current problem, but if the primary concern is around failed migrations, then we have a potential solution in the future to alleviate some of the user frustration.

@pmuellr
Copy link
Member Author

pmuellr commented Jul 14, 2021

we are introducing an interactive setup mode, which we think will lend itself nicely to future enhancements, such as interactive migrations (#100685)

Nice! But I wonder how that will work out in the cloud.

Speaking more generally about encrypted saved objects, it seems strange that we'd want to migrate the plaintext portion of an ESO when we know that the encrypted payload is corrupt or unrecoverable.

In general, it does seem strange, however, alerting rules only encrypt the API key, which we already have ways of "fixing" (disable/reenable or use the explicit API to regen the API key), and connector actions only encrypt the "secrets" object field, and we kinda have a way of dealing with that as well (isMissingSecrets field added for export/import where we remove the secrets). So, for us, this is almost workable.

I think the problem with disabling alerts (which should also include then deleting any task objects, but I'll ignore that for the moment) and marking actions as needing secrets, is that I don't think we have a great way of informing the user these things happened after the migration. Something like Notification Center messages could potentially be the answer for that, or some additional views in the alerting UI, but we'll be waiting a while for those. So, even though we could log that we left these objects in an "you must fix it" state, I'm sure lots of folks would miss that, and then at the same time their rules / actions wouldn't actually be running.

Failing the migration in all cases, with some reasonable messaging around the remediations (encryption key changed or delete/update/disable rules/connectors) still seems like the simplest / most direct thing to do, but again ... cloud. Do customers see these messages? What if there are 1000's of them?

Would it make any sense to make this configurable? So, user tries to migrate, it fails, but rather than fix a bunch of SO's, she's willing to have them migrated disabled / without secrets, so sets a config value and does the migration again. Additional complication, user choices, and probably not a setting you want to be persistent - but who would remember to delete it when you were finished?

@kobelb
Copy link
Contributor

kobelb commented Jul 14, 2021

Ignoring technical feasibility, in an ideal world, I think we should do the following:

  1. Fail the upgrade (option 2) if the new version of Kibana that is doing the upgrade has the wrong encryption key and no encrypted saved objects can be decrypted
  2. Let the upgrade proceed (option 3) if there are just a few encrypted saved objects that can't be decrypted

If the new version of Kibana that is performing the upgrade has the wrong encryption key, it's going to be a painful experience for our users to realize this after the upgrade has successfully completed and users have already begun making changes to Kibana. Having to tell users to restore from a snapshot and blow away any modifications that have been made since the upgrade makes me cringe.

However, I've heard so many anecdotal stories of our users starting up a rogue instance of Kibana and connecting it to a centralized deployment of Kibana and Elasticsearch, that it's led me to think this is incredibly common. Blocking upgrades in these scenarios will also be painful for our users because they might just have a singular alert that is misbehaving.

Is there anything we can do to come close to what I consider to be the ideal solution?

@lukeelmers
Copy link
Member

we are introducing an interactive setup mode, which we think will lend itself nicely to future enhancements, such as interactive migrations (#100685)

Nice! But I wonder how that will work out in the cloud.

We are in the early stages of discussion on this and there are still lots of questions, but suffice it to say the goal is to have interactive migrations work on Cloud 🙂

@ymao1
Copy link
Contributor

ymao1 commented Jul 15, 2021

Ignoring technical feasibility, in an ideal world, I think we should do the following:

  1. Fail the upgrade (option 2) if the new version of Kibana that is doing the upgrade has the wrong encryption key and no encrypted saved objects can be decrypted
  2. Let the upgrade proceed (option 3) if there are just a few encrypted saved objects that can't be decrypted

Is there anything we can do to come close to what I consider to be the ideal solution?

@kobelb Thanks for weighing in! From the alerting level, we don't get a holistic picture of the number of rule/connector saved objects that have failed to migrate vs the number that have succeeded, so we wouldn't have the information necessary to make the determination to proceed or fail. That would likely require a change to the migrations algorithm as a whole to retry the migrations if a % succeeded threshold is exceeded just for specific SO types....which sounds messy 😬

@ymao1
Copy link
Contributor

ymao1 commented Jul 15, 2021

Failing the migration in all cases, with some reasonable messaging around the remediations (encryption key changed or delete/update/disable rules/connectors) still seems like the simplest / most direct thing to do, but again ... cloud. Do customers see these messages? What if there are 1000's of them?

@pmuellr The failure messages show up in the Kibana logs. Do cloud customers have access to the Kibana logs?

@legrego
Copy link
Member

legrego commented Jul 15, 2021

Ignoring technical feasibility, in an ideal world, I think we should do the following:

1. Fail the upgrade (option 2) if the new version of Kibana that is doing the upgrade has the wrong encryption key and no encrypted saved objects can be decrypted

2. Let the upgrade proceed (option 3) if there are just a few encrypted saved objects that can't be decrypted

If the new version of Kibana that is performing the upgrade has the wrong encryption key, it's going to be a painful experience for our users to realize this after the upgrade has successfully completed and users have already begun making changes to Kibana. Having to tell users to restore from a snapshot and blow away any modifications that have been made since the upgrade makes me cringe.

However, I've heard so many anecdotal stories of our users starting up a rogue instance of Kibana and connecting it to a centralized deployment of Kibana and Elasticsearch, that it's led me to think this is incredibly common. Blocking upgrades in these scenarios will also be painful for our users because they might just have a singular alert that is misbehaving.

Is there anything we can do to come close to what I consider to be the ideal solution?

This wouldn't do exactly what you're asking for, but something akin to #92654 might help: If we can detect that the Kibana instance that's performing the migration has an encryption key that differs from what we expect, then we can choose to abort the migration. This is still pretty high-level, and I expect we'd need to introduce a new hook or concept into core to allow the ESO plugin to perform this check before/during the SO migration process.

@kobelb
Copy link
Contributor

kobelb commented Jul 15, 2021

Until we have #92654, or some other way to approximate what I consider to be the ideal solution, my preference is that we implement solution 3 and log a warning every time that decryption fails. I'm hesitant for us to completely fail the upgrade just because a single alert can't be decrypted properly.

@pmuellr
Copy link
Member Author

pmuellr commented Jul 15, 2021

The failure messages show up in the Kibana logs. Do cloud customers have access to the Kibana logs?

Only if they have set that up - meaning, created a separate monitoring cluster, and then told cloud to ship the logs there. Otherwise, I don't think they do.

If we go down a path of logging warnings instead of failing, it sure would be nice collect all those messages to present back to the user when the newly migrated deployment starts back up. Notification center again, I suppose ... Or I wonder if we could collect the messages from the migration run, and add to the "activity" section of the cloud UI (shows you the history of your changes to your deployment).

The upgrade assistant isn't available or useful when upgrading to the next minor or patch version.

It would be good, I think, if it could work for minor/patches. The scope is a little different, I sense the main idea for such a thing is to look for SO's that aren't structured correctly. So, maybe a different tool. Maybe the flow would be, migrations always fail, the remediation is to run "SO consistency app", it will guide the fixes to get things migrateable, and then you can try the migration again. I believe that for rules and connectors, even if they fail decryption, you can still bring them up in the UI, update them, and they'll end up getting fixed on the update.

Won't help with the "migrating but also inadvertently changed the encryption key". There's also a "no-migration" version of this problem, where a user restores from one deployment to another, same version, but different encryption key. I think I saw one of those ...

my preference is that we implement solution 3 and log a warning every time that decryption fails. I'm hesitant for us to completely fail the upgrade just because a single alert can't be decrypted properly.

Going all the way back to the top of this issue, I'm wondering if we need a change to the ESO migration to actually do the plain old SO part of the migration, even if decrypt fails. Today, we return the original doc, and I'm not sure what state that leaves us in. If the migration removed a field, but we return the original doc that still has the old field, ... just garbage? Not sure where we might have strict on for our object type fields where that would cause a failure on the index operation.

That would imply a change to the ESO migration to allow the document to be migrated, since currently it throws an error. I was thinking "maybe this will get fixed up later via migration-on-read", but that doesn't seem like that would be the case; it's presumably still running through this same code, throwing an error, returning the old document. So maybe it needs to take an "error" migrator, that would let us return an acceptable shape for the SO. And we could set a new field failedMigration or such, to help with subsequent diagnosis.

@kobelb
Copy link
Contributor

kobelb commented Jul 15, 2021

The failure messages show up in the Kibana logs. Do cloud customers have access to the Kibana logs?

ESS is in control of the ESO encryption keys, right? If so, we shouldn't have any situations where the encryption key was changed, Kibana was updated, and the saved-object migrations corrupt all of the alerts. However, if they're allowlisted in ESS for users to change them, this could theoretically happen.

Going all the way back to the top of this issue, I'm wondering if we need a change to the ESO migration to actually do the plain old SO part of the migration, even if decrypt fails. Today, we return the original doc, and I'm not sure what state that leaves us in. If the migration removed a field, but we return the original doc that still has the old field, ... just garbage? Not sure where we might have strict on for our object type fields where that would cause a failure on the index operation.

👍

@ymao1
Copy link
Contributor

ymao1 commented Jul 15, 2021

Thanks for all the input everyone! It sounds like there are a lot of great plans in the future with the detection of encryption key mismatch and notification center that can lead to a better upgrade experience but for now the solution we can live with will be migrating the document, whether or not it was successfully decrypted and not blocking Kibana upgrades on decryption errors

@pmuellr
Copy link
Member Author

pmuellr commented Jul 16, 2021

ESS is in control of the ESO encryption keys, right? If so, we shouldn't have any situations where the encryption key was changed, Kibana was updated, and the saved-object migrations corrupt all of the alerts. However, if they're allowlisted in ESS for users to change them, this could theoretically happen.

They aren't allow-listed in ESS, and you are correct, a normal upgrade should have no problems, and I've never seen this happen in practice.

I'm not sure if it's possible for an ESS user to restore a .kibana index from some other deployment (perhaps on-prem) into ESS - I think there are likely other issues with that beyond just this one. And it looks like ESS prevents system indices from being restored, near as I can tell from trying to do that :-)

I'm less concerned with that aspect (if it's even possible, and since human intervention will be required at that point) than I am with a remediation of "messages will be logged", since some users will have difficulty getting access to the logs. I think we need to figure out what to do mid- to long-term here, and I don't see a great short-term alternative over logging. Something like "tagging" these mal-migrated SO's with a field indicating their state could be nice from a diagnostic/support angle, but not really an actionable remediation.

@kobelb
Copy link
Contributor

kobelb commented Jul 16, 2021

I'm less concerned with that aspect (if it's even possible, and since human intervention will be required at that point) than I am with a remediation of "messages will be logged", since some users will have difficulty getting access to the logs

It's an entirely valid concern. I'm assuming that we have a fair number of operators who are under the impression that as long as Kibana starts up correctly after an upgrade, the upgrade went successfully. That's why I'd prefer that we explore options in the mid to long-term that prevent this from happening when the encryption key used for the upgrade is wrong.

The operator that is performing the upgrade will most likely have access to the logs as well, but it's wishful thinking that they will read these logs during upgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience estimate:small Small Estimated Level of Effort Feature:Actions/Framework Issues related to the Actions Framework Feature:Actions Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting research Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants