Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ingest Manager] Error handling in server APIs #66688

Closed
17 of 18 tasks
skh opened this issue May 15, 2020 · 6 comments
Closed
17 of 18 tasks

[Ingest Manager] Error handling in server APIs #66688

skh opened this issue May 15, 2020 · 6 comments
Assignees
Labels
Meta Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@skh
Copy link
Contributor

skh commented May 15, 2020

Currently, various parts of the APIs provided by Ingest Manager have implemented error handling and logging in different levels of completeness.

Overall, we should do the following when an error happens:

  • in the place where the error happens, use the Kibana Logger to log a descriptive error message. The logger is available since [Ingest] Use Kibana logger for proper server-side logging #66017
  • in the place where the error happens, throw a Boom error with a suitable status code that is not 500, and the same error message that was logged
  • In the place where the error happens, throw an IngestManagerError with a suitable IngestManagerErrorType and a helpful and descriptive error message.
  • in the request handler, if the caught error is a Boom error, use its status code in res.customError().
  • In the request handler, if the caught error is instanceof IngestManagerError, use res.customError() to return the error to the caller. Pass the message, and get a suitable HTTP response code from the getHTTPResponseCode() helper.
  • In the request handler, also use the Kibana Logger to log the error message to the Kibana log.
  • In the request handler, if the error is not IngestManagerError, use status code 500. In that case, log an error to the console with the full error message, and also log the stack trace of the error.

For an example how this looks in implementation see #66541

Implementation example: #67278

Reasoning

  • Kibana platform code logs a stack trace whenever a request handler returns a status code 500. This stack trace comes from within platform code and is not helpful in debugging the error. I find it also confusing, because it implies that the error hasn't been caught and handled correctly, which is not entirely true in our code. (There is Add error logs for HTTP 500 error details #65291 open for that.)
  • We should therefore use informative HTTP response codes whenever possible, or at least 400 bad request instead.
  • In cloud, customers can't inspect the Kibana log, so as much information as is practical should be provided through API error messages. This way, we can ask users e.g. on https://discuss.elastic.co/ to inspect their browser dev tools network tab and find out what happened. (Once we're in production, customers can ask support, and support agents will be able to inspect the Kibana logs.)
  • The UI can still decide not to show too much error information if that is not desired.

Tracking

This task is to go through each of these APIs and ensure it handles and reports errors properly:

  • setup
  • agent_config
  • enrollment_api_key
  • agent
  • epm
  • datasource
  • data_streams
  • install_script
  • output
  • settings
  • app
@skh skh added Meta Team:Fleet Team label for Observability Data Collection Fleet team Ingest Management:beta1 labels May 15, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/ingest-management (Team:Ingest Management)

@joshdover
Copy link
Contributor

  • in the place where the error happens, use the Kibana Logger to log a descriptive error message. The logger is available since [Ingest] Use Kibana logger for proper server-side logging #66017
  • in the place where the error happens, throw a Boom error with a suitable status code that is not 500, and the same error message that was logged
  • in the request handler, if the caught error is a Boom error, use its status code in res.customError().
  • in the request handler, only use status code 500 if there is no further information in the error. In that case, also log an error to the console with the full error.

A couple things here:

  • In general, we are moving away from any Hapi-specific APIs, including the Boom package. I do not suggest using this package.
  • Baking HTTP status codes deep in your business logic isn't generally recommended
    • We prefer not to couple our business logic to a specific consumer, like HTTP. I would instead recommend describing the exceptions in terms of your domain (for example, PackageMissingError) and then translating these domain-specific errors to the appropriate HTTP status code in your HTTP response handlers. This makes your business logic more flexible by allowing different endpoints to have different status code responses for the same error (which often makes sense in many situations).
    • You can also implement a generic error handler that you can wrap all of your HTTP endpoints in. This handler can be responsible for translating your domain-specific errors to HTTP responses.
  • Logging doesn't have to occur at the error throw site.
    • There should be no issues with moving your error logging up to the HTTP handler. Stack traces should be preserved so you can locate the original throw site. Logging at the throw site would require that you pass the logger everywhere, which is not recommended as it couples your code more tightly to the Core APIs.

Example Code

Business logic layer

class MyService {
  public async findPackage(packageName: string) {
    try {
      const package = await this.doApiCall(`/my-api`, { packageName });
      if (!package) {
        throw new PackageMissingError({ packageName });
      }
      return package;
     } catch (e) {
       // This is all made up for this example, but you get the idea
       if (e.timedOut) {
         throw new ElasticsearchUnavailableError(e);
       } else if (e.unauthenticated)
         throw new UnauthorizedError(e);
       }
     
       throw new UnknownError(e);
    }
  }
}

HTTP handler

router.get(
  { path: '/api/ingest/package', validation: { body: ... } },
  async (context, req, res) => {
    const { packageName } = req.body;
    try {
      const package = myService.findPackage(packageName);
      return res.ok({ package });
    } catch (e) {
      // This could be extracted into a generic `translateIngestError` util
      if (e instanceof PackageMissingError) {
        return res.notFound(); // don't log 'expected' or unexceptional errors
      } else if (e instanceof ElasticsearchUnavailableError) {
        log.warning(e);
        return res.unavailable();
      } else if (e instanceof UnauthorizedError) {
        return res.unauthorized();
      } else {
        log.error(e);
        return res.internalError();
      }
    }
  }
);

@skh
Copy link
Contributor Author

skh commented May 18, 2020

@joshdover Thanks for your comments! I'll adapt the original proposal.

What is the time frame for getting rid of Boom?

@joshdover
Copy link
Contributor

What is the time frame for getting rid of Boom?

No timeline right now, but we may consider removing Hapi altogether in the future to simplify our HTTP stack. Don't see this happening until some 8.x version.

@skh
Copy link
Contributor Author

skh commented May 25, 2020

I've updated the description, and opened #67278 with a minimal implementation example following @joshdover 's suggestions. (Thanks!)

I mostly agree. One notable exception is that I implemented the different error types as a typescript enum instead of having a class for each one. We don't really work with inheritance like this in the rest of the Ingest Manager code base and I didn't want to introduce yet another pattern.

Also, if we want to add more logic to the handlers (e.g like supporting different log levels) this should be easy to add later.

@jfsiii
Copy link
Contributor

jfsiii commented Oct 1, 2020

With #77975 & #74409 every route has the consistent logging and return values from this PR description.

Re-opened #65837 to track replacing our usage of Boom

@jfsiii jfsiii closed this as completed Oct 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Meta Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

No branches or pull requests

4 participants