Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch self repair and handling of "valid" error responses #1471

Closed
svenoe opened this issue Dec 1, 2021 · 14 comments
Closed

Elasticsearch self repair and handling of "valid" error responses #1471

svenoe opened this issue Dec 1, 2021 · 14 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@svenoe
Copy link
Contributor

svenoe commented Dec 1, 2021

If for some reason the creation of a ticket in Elasticsearch fails, all subsequent article creations and other ticket updates will fail. We receive a clear error response though, and should use it to try to create the ticket a second time. While at it, we could filter out errors from not being able to handle specific files (like encrypted files, etc.) from the ingest plugin, and maybe put an info text to the log, but not more.

For both to work, first the invokers must be extended to handle errors (task five of #772). The important places for this would be:

  • Kernel/GenericInterface/Requester.pm line 324: $FunctionResult = $TransportObject->RequesterPerformRequest(
  • Kernel/GenericInterface/Transport/HTTP/REST.pm line 872: $RestClient->$RestCommand(@RequestParam);
    I'm not sure how to do this in a nice way, without looking further into this. In principle we don't want the error handling right after the second, but the communication with the invoker usually is done in the first (e.g. line 206: my $FunctionResult = $InvokerObject->PrepareRequest( and especially interesting since we would probably need something similar line 226: elsif ( $FunctionResult->{StopCommunication}).

For the self repair of the ticket index, line 539-552 of Kernel/System/Console/Command/Maint/Elasticsearch/Migration.pm # create the ticket could be taken, which is quite simple in itself, some measurement has to be taken, though, to not cause infinite loops! :)

@svenoe svenoe added the enhancement New feature or request label Dec 1, 2021
@svenoe svenoe added this to the OTOBO 10.1.1 milestone Dec 1, 2021
@bschmalhofer
Copy link
Contributor

@svenoe
I started looking into this. My understanding is that there already is generic support for error handling. That is line 619 in _Kernel/GenericInterface/Requester.pm:

my $InvokerHandleErrorOutput = $Param{InvokerObject}->HandleError(
    Data => $HandleErrorData,
);

So the first line of investigation would be adding a HandleError() method in Kernel/GenericInterface/Invoker/Elasticsearch/TicketManagement.pm .
I was wondering how to provoke an error in submitting the ticket to ES. The most simple idea I had, is to shut down ES temporarily. Is there a better approach?

@svenoe
Copy link
Contributor Author

svenoe commented Dec 16, 2021

The problem is, that if I understand correctly the real error message does not reach this point because of

return {
Success => 0,
ErrorMessage => $ResponseError,
};
where $ResponseError contains some not so useful information. But probably(?) you are right and we should use this method, but give it useful data.

As to provoking errors, yes, shutting down the ES container is one option. In the end you probably should delete a ticket from the index, too, and update its queue, or so, anyways. I propose copying a normal request out of the Webservice debugger, altering it and sending it via curl. Maybe the copying part is not even needed - it looks relatively easy: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete.html (Or the more complicated version: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete-by-query.html).

@bschmalhofer
Copy link
Contributor

Investigated which clients can be used for manipulating Elasticsearch. Curator, https://www.elastic.co/guide/en/elasticsearch/client/curator/current/about.html, does not do the job, as it is intended for the maintenance of indices. Fell back to using a small script based on Search::Elasticsearch, https://metacpan.org/pod/Search::Elasticsearch.

Checked what happens when updating the queue of a ticket that has been deleted in Elasticsearch.

Response content: '{"error":{"root_cause":[{"type":"document_missing_exception","reason":"[_doc][4]: document missing","index_uuid":"8iUJUa2zR3eaxx_mJjzo3A","shard":"0","index":"ticket"}],"type":"document_missing_exception","reason":"[_doc][4]: document missing","index_uuid":"8iUJUa2zR3eaxx_mJjzo3A","shard":"0","index":"ticket"},"status":404}'

The error message is useful for readding the relevant ticket to Elasticsearch.

@bschmalhofer
Copy link
Contributor

While looking into this, I noticed that there is a mixup of message for the generic interface debugger. When a mapping module can't be initialized then the error message of a previous function result is written into the debug log.
The underlying cause is the reusing of the variable $FunctionResult.

Creating an extra issue for this, as this is a bug.

@bschmalhofer
Copy link
Contributor

bschmalhofer commented Dec 23, 2021

Digging deep into the Generic Interface code I find that there are at lease three opportunities for error handling.

  1. Error handling modules in Kernel/GenericInterface/ErrorHandling that were registered in the webservice configuration
  2. HandleError() methods in the Invoker backend
  3. HandleResponse() with the parameter ResponseSuccess => 0 in the Invoker backend

For the case 1 there is currently only Kernel::GenericInterface::ErrorHandling::RequestRetry which isn't used in the default configuration. The RequestRetry.pm mechanism is like:

  • RequestRetry.pm creates ReSchedule and PastExecution data and returns that data to the requester Kernel::GenericInterface::Requester.
  • The Requester creates a scheduled task with the same invoker
  • The scheduled task is eventually run by the OTOBO Daemon
  • The Invoker backend can in turn react to the PastExecutionData,

The case 2 is used for dynamic fields. In this case HandleError() is mapped to HandleResponse(). My impression is that this intended as a more straight forward callback mechanism.

The case 3 is a bit confusing. It seems to be redundant with regard to the case 2.

The goal of this issue seems to be more like case 1. It is mostly a retry with a twist. But there can be some nasty cases. When Elasticsearch is down and then restarted, then a lot of tasks can be queued up. When there are many independent changes to a ticket, then each change would cause a reindexing of the ticket in Elasticsearch.

An alternative approach would be:

  • whenever there is an indexing error then the ticket is marked as dirty in the database
  • there is a cronjob that reindexes the dirty issues

TODO;

  • investigate the feasability of the alternative approach, where tickets are marked as dirty

@svenoe
Copy link
Contributor Author

svenoe commented Dec 23, 2021

As a quick comment - I do not see it as a real "redo with a twist", yes you are right for the case of rebuilding the complete ticket it definitely is (and yes, care has to be taken to not run into loops), but it has other purposes, too, e.g. if the ingest plugin which extracts data from an attachment returns that it cannot handle encrypted attachments, I don't want an error message in the logs, maybe an info or debug statement, but the plugin complaining is totally fine. So in the end I see it more as a way to silently exit and just potentially start some additional action, as the rebuild.

@bschmalhofer
Copy link
Contributor

It looks the search in Elasticsearch and the search in the database are pretty much independent of each other. The database search, especially the article word index, is active per default. The Elasticsearch indexing must be set up as a web service.

There is the table attribute article.search_index_needs_rebuild which act as a dirty mark. Currently this is used only in the Console Commands Maint::Ticket::FulltextIndex.pm and Maint::Ticket::FulltextIndexRebuildWorker. In this setting only the values 0 and 1 are used. This means that this attribute could be abused to also mark the Elasticsearch Index as dirty. Of course adding a new table attribute is also a feasible option.

@bschmalhofer
Copy link
Contributor

Discussed this issue with @svenoe and we concluded:

  • Webservices are used predominately in extensions of OTOBO. Care must be taken that the interface is not changed.
  • do an early check of errors which are expected, those shouldn't be logged as errors. Example: encrypted mails
  • Maybe implement a general mechanism for error handling that mark errors as resolved
  • In some error cases reindex the article via a scheduled task. It is not decided yet, where that task is created.

@bschmalhofer
Copy link
Contributor

bschmalhofer commented Jan 7, 2022

Back to the drawing board as this is fairly confusing.
It looks like the annoying message is the

$DebuggerObject->Error(
    Summary => $Param{Summary},
    Data    => $Param{Data},
);

in Kernel::GenericInterface::ErrorHandling::HandleError() Currently this method is called in Kernel::GenericInterface::Requester::_HandleError()whenever the transport layer thinks that something is fishy. This can be changed by allowing error handling modules to specify whether the error message should be written to the communication log. The message is then written only when:

  • there are no error handling modules, which is up to now the most common case
  • not all error handling module, that have run, requested to suppress the message

It is expected that error handling modules that suppress the message also set StopAfterMatch.

@bschmalhofer
Copy link
Contributor

The tentative names for the new error handling modules are:

  • FailureAccept
  • ArticleReindex

@bschmalhofer
Copy link
Contributor

bschmalhofer commented Jan 13, 2022

Started working on the FailureAccept error handling module. Found that adding a new module means a lot of duplication, especially in the admin interface. suppose it makes sense to make the error handling modules more generic.
TODO:

  • Factor out common HTML snippets
  • Factor out common functions in the frontend modules

Edit: the TODO items won't be done

@svenoe
Copy link
Contributor Author

svenoe commented Jan 13, 2022

Please have a quick look at my message... ;)

bschmalhofer added a commit that referenced this issue Jan 20, 2022
bschmalhofer added a commit that referenced this issue Jan 20, 2022
bschmalhofer added a commit that referenced this issue Jan 20, 2022
- a more concise constructor
- using 'unless' in end of line control flow
- empty line before return
bschmalhofer added a commit that referenced this issue Jan 20, 2022
bschmalhofer added a commit that referenced this issue Jan 20, 2022
bschmalhofer added a commit that referenced this issue Jan 20, 2022
bschmalhofer added a commit that referenced this issue Jan 20, 2022
for custom callback that assess the validty of a request response.
For now this is only available for HTTP::REST transport objects.
There should be no functional changes yet.
@bschmalhofer
Copy link
Contributor

bschmalhofer commented Jan 21, 2022

Discussed this with @svenoe . We found that using Error modules is not worth it, especially as it would not affect the first message in HTTP::REST anyways. The new approach is to add a AssessResponse() method in TicketManagement.pm. The reindexing of the ticket will be done in HandleResponse().

bschmalhofer added a commit that referenced this issue Jan 24, 2022
Using the $ESObject instead.
bschmalhofer added a commit that referenced this issue Jan 24, 2022
bschmalhofer added a commit that referenced this issue Jan 24, 2022
Index attachment by attachment, so that error handling is easier
bschmalhofer added a commit that referenced this issue Jan 24, 2022
@bschmalhofer
Copy link
Contributor

The current state of the implementation appears to be stable and working. But it is expected that further instances of unhelpful errors will crop up. These should be handled in new issue. Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants