Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] 7.8 cloud returning 500 response for event log queries #68265

Closed
pmuellr opened this issue Jun 4, 2020 · 10 comments · Fixed by #68331
Closed

[Alerting] 7.8 cloud returning 500 response for event log queries #68265

pmuellr opened this issue Jun 4, 2020 · 10 comments · Fixed by #68331
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Jun 4, 2020

I can query the event log via it's HTTP endpoint in 7.8 when running locally, but when I run in the cloud I see the following error logged, and a 500 is returned from the HTTP request:

Error: querying for Event Log by for type "alert" and id "86...34" failed with: 
   [query_shard_exception] failed to create query: 
   [nested] nested object under path [kibana.saved_objects] is not of nested type, with { 
      index_uuid="tfIsEHFSQm-Sec_337bljw" & index=".kibana-event-log-7.8.0" 
   }
   at ClusterClientAdapter.queryEventsBySavedObject (../kibana/x-pack/plugins/event_log/server/es/cluster_client_adapter.js:203:13)
    at process._tickCallback (internal/process/next_tick.js:68:7)

For 7.8 we have exposed the HTTP endpoint for event log queries, but are not actively using it, so it's not a blocker for 7.8.0, but something we should try to get in for 7.8.1.

@pmuellr pmuellr added bug Fixes for quality problems that affect the customer experience Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jun 4, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member Author

pmuellr commented Jun 4, 2020

So, it gets a little worse - it appears the mappings that got created were "dynamic" - it didn't use the mappings we set up in the index template. On further notice, the index is actually named .kibana-event-log-7.8.0 and not .kibana-event-log-7.8.0-00001 as expected - the former should be the alias to the latter, but apparently there's no alias. Looks like none of the event log set up must have ran!

This is probably still ok for 7.8, since we don't use the event log to query, but means the 7.8.x event logs are kind of hosed. No ILM.

Looking in the logs at the first start of this Kibana instance, I can see the following with tag eventLog:

error initializing elasticsearch resources: error checking existance of ilm policy: Bad Request

That code is here:

public async doesIlmPolicyExist(policyName: string): Promise<boolean> {
const request = {
method: 'GET',
path: `_ilm/policy/${policyName}`,
};
try {
await this.callEs('transport.request', request);
} catch (err) {
if (err.statusCode === 404) return false;
throw new Error(`error checking existance of ilm policy: ${err.message}`);
}
return true;
}

However, I can curl that API manually with the elastic user:

$ curl $ES_URL/_ilm/policy/foo | json
{
  "error": {
    "root_cause": [
      {
        "type": "resource_not_found_exception",
        "reason": "Lifecycle policy not found: foo"
      }
    ],
    "type": "resource_not_found_exception",
    "reason": "Lifecycle policy not found: foo"
  },
  "status": 404
}

So, seems like a permissions problem perhaps?

@pmuellr
Copy link
Member Author

pmuellr commented Jun 4, 2020

Hmm ... for a newly created user, doing the ILM check, I get a 403, with no message of "Bad Request" that I can see ... I wonder if it's transport.request that's causing the issue ...

@pmuellr
Copy link
Member Author

pmuellr commented Jun 4, 2020

I believe we structured this so that if there's any error doing the EL init, that we basically disable the event log, but it looks like that's not quite working, and ... not sure why. Looks like all the errors are being caught in the right places in the following files:

The problem is that if the init fails, the writes will end up create a new index with the non-ILM'd name (because we write to the alias, but with no alias, a new index with that name will be created), with no template, ilm etc, which will be hard to convert back to it's ILM-able version.

@pmuellr
Copy link
Member Author

pmuellr commented Jun 4, 2020

Remembering now that we renamed the ILM policy to remove the leading ., so it should be named kibana-event-log-policy. ILM policies aren't indices, so there's no index pattern matching privilege thing that should be happening AFAIK.

@mikecote
Copy link
Contributor

mikecote commented Jun 4, 2020

In regards to missing permissions, its very likely because this change in Elasticsearch was required: elastic/elasticsearch#46894.

Cloud has a different list of permissions they grant to Kibana user (for now) and it's a manual step to grant privileges to Kibana on cloud. We had a similar issue with API key where kibana user in cloud didn't auto-get the privileges after merging a PR in Elasticsearch.

Pinging @nachogiljaldo to see if elastic/elasticsearch#49451 is reflected on cloud?

EDIT: The permissions are granted on cloud.

Also makes sense that we should fix the event log from logging events to Elasticsearch when it fails to initialize.

@pmuellr
Copy link
Member Author

pmuellr commented Jun 4, 2020

It looks like we've only done 1/2 the work on "disabling" the event log in the case of initialization errors, unrelated to whatever the problem is here. Because the event log isn't "disabled", we end up writing to an index with the name of the alias we wanted to use, meaning we can't create the alias later. If we disable the event log, at least we won't be creating the index, and the next time we try to initialize, perhaps it will work (customer changes configuration that now allows it to work).

I opened #68309 for this.

pmuellr added a commit to pmuellr/kibana that referenced this issue Jun 4, 2020
resolves elastic#68265

This changes the ILM requests made by the eventLog from relative to absolute
URLs.  These requests test the existence of and create ILM policies, and are
made with a cluster client using `transport.request`.  Relative URLs work fine
locally and in CI, however do not work on the cloud.
@nachogiljaldo
Copy link

@mikecote just to confirm, anything to do from our side here?

@mikecote
Copy link
Contributor

mikecote commented Jun 5, 2020

@nachogiljaldo Nothing to confirm anymore, I answered my own question. Sorry for the ping.

@pmuellr
Copy link
Member Author

pmuellr commented Jun 5, 2020

Hopefully PR #68331 will resolve this.

pmuellr added a commit that referenced this issue Jun 5, 2020
resolves #68265

This changes the ILM requests made by the eventLog from relative to absolute
URLs.  These requests test the existence of and create ILM policies, and are
made with a cluster client using `transport.request`.  Relative URLs work fine
locally and in CI, however do not work on the cloud.
pmuellr added a commit to pmuellr/kibana that referenced this issue Jun 5, 2020
resolves elastic#68265

This changes the ILM requests made by the eventLog from relative to absolute
URLs.  These requests test the existence of and create ILM policies, and are
made with a cluster client using `transport.request`.  Relative URLs work fine
locally and in CI, however do not work on the cloud.
pmuellr added a commit to pmuellr/kibana that referenced this issue Jun 5, 2020
resolves elastic#68265

This changes the ILM requests made by the eventLog from relative to absolute
URLs.  These requests test the existence of and create ILM policies, and are
made with a cluster client using `transport.request`.  Relative URLs work fine
locally and in CI, however do not work on the cloud.
pmuellr added a commit that referenced this issue Jun 9, 2020
…8391)

resolves #68265

This changes the ILM requests made by the eventLog from relative to absolute
URLs.  These requests test the existence of and create ILM policies, and are
made with a cluster client using `transport.request`.  Relative URLs work fine
locally and in CI, however do not work on the cloud.
pmuellr added a commit that referenced this issue Jun 9, 2020
…8390)

resolves #68265

This changes the ILM requests made by the eventLog from relative to absolute
URLs.  These requests test the existence of and create ILM policies, and are
made with a cluster client using `transport.request`.  Relative URLs work fine
locally and in CI, however do not work on the cloud.
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants